diff --git a/notebooks/class_3/week_2/00_intro_to_pandas.ipynb b/notebooks/class_3/week_2/00_intro_to_pandas.ipynb index 005a9243..b6255028 100644 --- a/notebooks/class_3/week_2/00_intro_to_pandas.ipynb +++ b/notebooks/class_3/week_2/00_intro_to_pandas.ipynb @@ -8,10 +8,11 @@ "\n", "In the last class, we discussed how to work with numerical data through arrays and matrices in cases when *all the data were of the same type*. You encounter this often, so `numpy` is a powerful tool for working with such homogeneous data (data of the same type) and in such cases is often the preferred tool if you want to do numeric computations because of its speed. However, there are several frequently encountered situations in which `numpy` tools may not be ideal and where instead the `pandas` library may be best. These include:\n", "\n", - "1. If you do not have purely numerical data (i.e. if your data are heterogeneous), and therefore have a mix of data types, `numpy` may not be appropriate, while `pandas` can easily handle and analyze mixed data types\n", - "2. If you have tabular data, `pandas` makes life much easier to **describe**, **summarize**, **query**, and **visualize** the data than the equivalent processes with `numpy`.\n", + "1. Your data includes not only numerical data, but also text, images, geometric objects, or any other type of data (i.e., your data are heterogeneous). In these situations, `numpy` may not be appropriate, while `pandas` can easily organize, manipulate, and analyze this kid of \"mixed data.\"\n", + "2. If you have tabular data — that is, data organized into rows and columns, as in a spreadsheet — it is often easier in `pandas` to **describe**, **summarize**, **query**, and **visualize** the data than the equivalent processes with `numpy`.\n", "\n", "## Mixed data types\n", + "\n", "Often our data are of mixed types (e.g. integers and strings). This happens all the time. Imagine that you are collecting basic medical information from a patient. You may ask for height and weight (numerical, floating point numbers), age (integer), and blood type (categorical, string). While numpy can store these together in an array, there's not much you're going to be able to do with it computationally. Consider the following example where `numpy` throws an error:" ] }, diff --git a/notebooks/class_3/week_2/37_object_and_categorical_dtypes.ipynb b/notebooks/class_3/week_2/37_object_and_categorical_dtypes.ipynb new file mode 100644 index 00000000..69f1758f --- /dev/null +++ b/notebooks/class_3/week_2/37_object_and_categorical_dtypes.ipynb @@ -0,0 +1,141 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# The Object and Categorical Data Types\n", + "\n", + "In our previous readings, we introduced the idea that not only can DataFrames and Series hold any of the numeric data types we've come to know and love from numpy — like `float64` or `int64` — but that they can also hold arbitrary Python objects in an `object`-type Series. Moreover, we also acknowledged (though didn't go into) the existence of another relatively unique pandas data type — the `Categorical` Series. \n", + "\n", + "In this reading, we will discuss the purpose of these two data types, as well as their advantages and disadvantages.\n", + "\n", + "## The object dtype\n", + "\n", + "The `object` type Series gives pandas _incredible_ flexibility as it allows any type of data to be stored in a table. The most common use of the `object` data type is to store text — for example, names, addresses, written answers, etc. — but the flexibility can also be used for applications like geospatial analysis, in which a single row of a DataFrame may represent a single country, the columns may represent features of the country (name, income, population), and the last column stores geometric Python objects that describe the shape and location of the country. \n", + "\n", + "But this flexibility also comes at a cost — performance and memory efficiency.\n", + "\n", + "### The object Performance Penalty" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To understand why `object` Series are slow, it helps to first discuss why numeric Series (and numeric numpy arrays) are fast. \n", + "When you work with a numeric pandas Series or numpy array, all of the entries in those arrays live in the same place in memory (in your computer's RAM). This is possible because all those integers (or all those floating point numbers) are written with the same number of 0s and 1s. In an `int64` Series, for example, every integer is represented by 64 1s and 0s. This makes it easy for the computer to lay them out sequentially. Moreover, it makes it possible for the computer to find specific entries quickly, since it knows that the third integer in the Series will start 64 * 2 spots from the start of the Series and end 64 spots later.\n", + "\n", + "But an *object* Series is a little different. Python objects vary in size — some may only take up 128 0s and 1s, while others may require thousands — and so the actual data in an object Series can't be laid down in a nice regularly spaced sequence. Instead, every entry in an object Series gets put in a different location in your computer's memory (RAM), and only the *address* of that information is placed in a nice organized Series. These addresses are all the same size, and so the addresses can be organized in a regular manner, even if the actual content you want to store is irregular.\n", + "\n", + "The cost of this arrangement is that if you ask for the second entry of an object Series (e.g., `my_series.iloc[1]`), you're computer has first go to the second location in the array, read the address stored there, *then go to that address* to find the actual content you want. And those added steps waste time.\n", + "\n", + "The other problem with object Series is that because they can store anything, Python doesn't know before it looks up an entry whether it will find a string, an integer, or a Python set. As a result, when it sees code like: \n", + "\n", + "```python \n", + "my_array * 2\n", + "```\n", + "\n", + "Python can't be sure what is meant by `*` — it could mean \"do integer multiplication\" (if a given entry in `my_array` is an integer), but it might also mean \"double up the list you find\" (if the entry is a Python list)!\n", + "\n", + "Indeed we can see this if we make a numpy object array full of ints and compare it to a numpy integer array. The both have the same content, but they are organized in memory differently:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "\n", + "object_numbers = pd.Series(np.arange(1000000), dtype=\"object\")\n", + "numbers = pd.Series(np.arange(1000000), dtype=\"int64\")" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "16.1 ms ± 231 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" + ] + } + ], + "source": [ + "%timeit object_numbers * 2" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "743 µs ± 7.57 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n" + ] + } + ], + "source": [ + "%timeit numbers * 2" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "See? Same operation (doubling each entry of arrays with the integers from 0 to 1,000,000), but the object array operation is ~20x slower. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "So yes, `object` dtypes are absolutely wonderful and introduce unbelievable flexibility to pandas; but remember there is a cost to using them, so stick to numeric data types when possible!" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Categoricals" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Categoricals are a terrific little hack." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}