{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "8096b98a", "metadata": {}, "source": [ "Here we are giving a brief introduction in working with IAMC-styled data with pandas and pandas-indexing." ] }, { "cell_type": "code", "execution_count": null, "id": "bee121f7", "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "attachments": {}, "cell_type": "markdown", "id": "814f4bb8-be13-47cd-9853-eb23f61a1da3", "metadata": {}, "source": [ "# Test data set\n", "\n", "For experimenting and easy testing `pandas-indexing` brings along the power sector generation and capacity of the HighRE illustrative modelling pathway from the IPCC AR6 scenario database in IAMC format." ] }, { "cell_type": "code", "execution_count": null, "id": "48012a50-1881-4c41-abad-4062828331a5", "metadata": {}, "outputs": [], "source": [ "from pandas_indexing.datasets import remindhighre_power\n", "\n", "\n", "df = remindhighre_power()\n", "df.head()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "0d61c065", "metadata": {}, "source": [ "# Usage styles\n", "\n", "`pandas-indexing` defines two different usage styles:\n", "\n", "1. functions that can be imported from the toplevel module, like\n", " \n", " ```python\n", " from pandas_indexing import assignlevel\n", " assignlevel(df, unit=\"Mt CO2e/yr\")\n", " ```\n", "2. convenience accessors that are hooking into pandas as extensions, like\n", " \n", " ```python\n", " df.pix.assign(unit=\"Mt CO2e/yr)\n", " ```\n", "\n", "Most of the functionality is available with both styles under slightly different names. I'll present the functional style here first (and add the alternative as comments)" ] }, { "cell_type": "code", "execution_count": null, "id": "e7586672", "metadata": {}, "outputs": [], "source": [ "from pandas_indexing.core import describelevel\n", "\n", "\n", "describelevel(df) # or: df.pix" ] }, { "cell_type": "code", "execution_count": null, "id": "a4f23678", "metadata": {}, "outputs": [], "source": [ "df.pix" ] }, { "attachments": {}, "cell_type": "markdown", "id": "5b948e75", "metadata": {}, "source": [ "As one can see the IAMC format is defined by five index levels: `model`, `scenario`, `variable`, `unit` and `region`. In this data subset, we have a single `model`, `scenario` combination for one `region` and with several capacity `variable`s starting with `Capacity|Electricity|` and generation variables starting with `Secondary Energy|Electricity|`.\n", "\n", "The data comes with two different units: `GW` and `GWh/yr`, (hopefully) for capacity and generation, respectively." ] }, { "attachments": {}, "cell_type": "markdown", "id": "61d70ee2", "metadata": {}, "source": [ "# Selecting data\n", "\n", "For using pandas indexes effectively for computations, it makes sense to split the hierarchically variable index out into separate python variables: `generation` and `capacity`. The standard pandas tools for this job are `pd.DataFrame.loc` in conjunction with `pd.IndexSlice` or `pd.DataFrame.query`. \n", "\n", "`pandas_indexing` brings `ismatch` and `isin` to make this job as easy as possible." ] }, { "cell_type": "code", "execution_count": null, "id": "36d58553", "metadata": {}, "outputs": [], "source": [ "from pandas_indexing import isin, ismatch # no .idx equivalents" ] }, { "cell_type": "code", "execution_count": null, "id": "982b9663", "metadata": {}, "outputs": [], "source": [ "df.loc[ismatch(variable=\"Capacity|**\"), 2030]" ] }, { "attachments": {}, "cell_type": "markdown", "id": "5f0dfdd8", "metadata": {}, "source": [ "`ismatch` allows using a glob-like pattern to subset into one or multiple named levels, together with the standard `rename` method we can get cleaned up capacity and generation data easily:" ] }, { "cell_type": "code", "execution_count": null, "id": "711f2683", "metadata": {}, "outputs": [], "source": [ "generation = df.loc[ismatch(variable=\"Secondary Energy|**\")].rename(\n", " index=lambda s: s.removeprefix(\"Secondary Energy|Electricity|\")\n", ")\n", "generation" ] }, { "attachments": {}, "cell_type": "markdown", "id": "3087eb3c", "metadata": {}, "source": [ "Since this extraction of data is relatively common, `extractlevel` simplifies this by matching against a format-like template string:" ] }, { "cell_type": "code", "execution_count": null, "id": "16032dd2", "metadata": {}, "outputs": [], "source": [ "from pandas_indexing import extractlevel, formatlevel\n", "\n", "\n", "generation = extractlevel(df, variable=\"Secondary Energy|{carrier}|{fuel}\", drop=True)\n", "capacity = extractlevel(df, variable=\"Capacity|{carrier}|{fuel}\", drop=True)\n", "# or: df.pix.extract(variable=\"Secondary Energy|{carrier}|{fuel}\")\n", "generation" ] }, { "attachments": {}, "cell_type": "markdown", "id": "405720c7", "metadata": {}, "source": [ "The inverse operation is to combine strings back together with `formatlevel`:" ] }, { "cell_type": "code", "execution_count": null, "id": "8e355684", "metadata": {}, "outputs": [], "source": [ "formatlevel(generation, variable=\"Secondary Energy|{carrier}|{fuel}\", drop=True)\n", "# or: df.pix.format(variable=\"Secondary Energy|{carrier}|{fuel}\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "e78ace7d", "metadata": {}, "source": [ "With `generation` and `capacity` conveniently split into separate variables, we can calculate capacity factors (ratios of generation and capacity) directly, as long as we take care of removing the conflicting `unit` level. Similarly to `ismatch`, `isin` can be provided as an argument to `.loc[]` to select on named index levels with the difference that only exact matches are considered." ] }, { "cell_type": "code", "execution_count": null, "id": "9b4be2b0", "metadata": {}, "outputs": [], "source": [ "capacity_factor = generation.droplevel(\"unit\") / 8760 / capacity.droplevel(\"unit\")\n", "capacity_factor.loc[isin(fuel=[\"Solar\", \"Wind\", \"Hydro\", \"Geothermal\"]), 2030:2051]" ] }, { "cell_type": "markdown", "id": "d62fabd9", "metadata": {}, "source": [ "Instead of dropping the `unit` level, there is also a set of unit-aware calculation functions, so that this full capacity factor calculation can be performed in very few steps (the unit aware calculation realizes correctly that the capacity factor is unit-less):" ] }, { "cell_type": "code", "execution_count": null, "id": "94c1f308", "metadata": {}, "outputs": [], "source": [ "generation = extractlevel(df, variable=\"Secondary Energy|{carrier}|{fuel}\", drop=True)\n", "capacity = extractlevel(df, variable=\"Capacity|{carrier}|{fuel}\", drop=True)\n", "generation.pix.unitdiv(capacity)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "b4d501a3", "metadata": {}, "source": [ "Under the hood `isin` and `ismatch` generate `Selector` objects. They can be composed into complex queries intuitively, which are kept as a hierarchical structure of objects." ] }, { "cell_type": "code", "execution_count": null, "id": "4197be7d", "metadata": {}, "outputs": [], "source": [ "query = isin(fuel=[\"Coal\", \"Gas\", \"Nuclear\"], unit=\"GW\") & ~ismatch(fuel=\"S*\")\n", "query" ] }, { "attachments": {}, "cell_type": "markdown", "id": "f137127f", "metadata": {}, "source": [ "For evaluating such a query one needs to pass in a data object to produce a boolean mask. Since pandas `.loc` indexer does exactly that, these queries work as expected.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "18401aa0", "metadata": {}, "outputs": [], "source": [ "query(capacity)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "7e1d4f00", "metadata": {}, "source": [ "````{note}\n", "It is only possible from version 0.5.2 to use a pandas boolean series **in front of** a selector; ie.\n", "```python\n", "(capacity[2030] > 250) & isin(variable=[\"Coal\", \"Gas\", \"Nuclear\"], unit=\"GW\")\n", "```\n", "works, as you would expect it, in the same way as\n", "```python\n", "isin(variable=[\"Coal\", \"Gas\", \"Nuclear\"], unit=\"GW\") & (capacity[2030] > 250)\n", "```\n", "````" ] }, { "cell_type": "code", "execution_count": null, "id": "8b421dba", "metadata": {}, "outputs": [], "source": [ "high_capacity_fossil = capacity.loc[\n", " isin(fuel=[\"Coal\", \"Gas\", \"Nuclear\"], unit=\"GW\") & (capacity[2030] > 250),\n", " :2041,\n", "]\n", "high_capacity_fossil" ] }, { "attachments": {}, "cell_type": "markdown", "id": "50fe70b7", "metadata": {}, "source": [ "The simple fact that this is an operation on `[]`, means that we can also use it to modify values in-place:" ] }, { "cell_type": "code", "execution_count": null, "id": "71d7aa7d", "metadata": {}, "outputs": [], "source": [ "high_capacity_fossil.loc[isin(fuel=\"Gas\"), 2030:] = 1000.0\n", "high_capacity_fossil" ] }, { "attachments": {}, "cell_type": "markdown", "id": "25dd3f6e", "metadata": {}, "source": [ "Most methods in `pandas_indexing` do not care whether they are run on an index, a series or a dataframe, but will transiently take care of handing them down to the appropriate level:" ] }, { "cell_type": "code", "execution_count": null, "id": "e0ff32d5", "metadata": {}, "outputs": [], "source": [ "fossil_series = (\n", " capacity.loc[isin(fuel=[\"Coal\", \"Gas\", \"Nuclear\"]), [2030, 2040, 2050, 2060]]\n", " .rename_axis(columns=\"year\")\n", " .stack()\n", ")\n", "fossil_series" ] }, { "cell_type": "code", "execution_count": null, "id": "6d4686e6", "metadata": {}, "outputs": [], "source": [ "fossil_series.loc[isin(year=[2030, 2050])]" ] }, { "cell_type": "code", "execution_count": null, "id": "8d4e91fb", "metadata": {}, "outputs": [], "source": [ "isin(fossil_series.index, fuel=\"Nuclear\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "36c41351", "metadata": {}, "source": [ "# Selecting based on a multi-index\n", "\n", "If we need pairs of data like `Coal` in 2030 and `Gas` in 2035 and `Nuclear` in 2040 and 2050, then we can pass a multiindex to `isin`:" ] }, { "cell_type": "code", "execution_count": null, "id": "cde4ff9f", "metadata": {}, "outputs": [], "source": [ "idx = pd.MultiIndex.from_tuples(\n", " [(\"Coal\", 2030), (\"Gas\", 2035), (\"Nuclear\", 2040), (\"Nuclear\", 2050)],\n", " names=[\"fuel\", \"year\"],\n", ")\n", "idx" ] }, { "cell_type": "code", "execution_count": null, "id": "7d4edea9", "metadata": {}, "outputs": [], "source": [ "fossil_series.loc[isin(idx)]" ] }, { "cell_type": "markdown", "id": "8bce1bf9", "metadata": {}, "source": [ "Since `(\"Gas\", 2035)` is not part of the original `fossil_series` it is silently ignored, just like with other uses of `isin`.\n", "\n", "Alternatively, the same result can be retrieved with the more powerful `semijoin` using an `\"inner\"` join:" ] }, { "cell_type": "code", "execution_count": null, "id": "dc294f2f", "metadata": {}, "outputs": [], "source": [ "from pandas_indexing import semijoin\n", "\n", "\n", "semijoin(\n", " fossil_series, idx, how=\"inner\"\n", ") # or: fossil_series.pix.semijoin(idx, how=\"inner\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "cbaa73b0", "metadata": {}, "source": [ "\n", "A `\"right\"`-join on the other hand will follow the order and keep all elements of the provided `idx`. Since `(\"Gas\", 2035)` is not part of the original `fossil_series` it shows up as `NaN`s here:" ] }, { "cell_type": "code", "execution_count": null, "id": "f82d6bf4", "metadata": {}, "outputs": [], "source": [ "semijoin(fossil_series, idx, how=\"right\")" ] }, { "cell_type": "markdown", "id": "4c420dc1", "metadata": {}, "source": [ "It is also possible to get the inverted result, with only the not matching rows, with an `antijoin`" ] }, { "cell_type": "code", "execution_count": null, "id": "bed99ac9", "metadata": {}, "outputs": [], "source": [ "from pandas_indexing import antijoin\n", "\n", "\n", "antijoin(fossil_series, idx)\n", "# or: fossil_series.pix.antijoin(idx)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "f2e1e0e4", "metadata": {}, "source": [ "# Projecting levels\n", "\n", "Often after selecting the right subsets, ie the interesting `model` or `scenario` it makes sense to consolidate the data to a given set of `levels`. That is what `projectlevel` is used for:" ] }, { "cell_type": "code", "execution_count": null, "id": "6d0000aa", "metadata": {}, "outputs": [], "source": [ "from pandas_indexing import projectlevel\n", "\n", "\n", "simple_fossil_series = projectlevel(fossil_series, [\"fuel\", \"year\"])\n", "# or: fossil_series.pix.project([\"fuel\", \"year\"])\n", "simple_fossil_series" ] }, { "attachments": {}, "cell_type": "markdown", "id": "715fe070", "metadata": {}, "source": [ "`projectlevel` reduces the levels attached to a multiindex to the ones explicitly named. It is basically the complement to `droplevel` which removes the listed names" ] }, { "cell_type": "code", "execution_count": null, "id": "308fb920", "metadata": {}, "outputs": [], "source": [ "projectlevel(fossil_series, [\"model\", \"scenario\"]) == fossil_series.droplevel(\n", " [\"carrier\", \"fuel\", \"unit\", \"region\", \"year\"]\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "6eb7b029", "metadata": {}, "source": [ "# Assigning to levels\n", "\n", "`assignlevel` allows to modify individual values with helpful keyword arguments," ] }, { "cell_type": "code", "execution_count": null, "id": "ca2a3c70", "metadata": {}, "outputs": [], "source": [ "from pandas_indexing import assignlevel\n", "\n", "\n", "assignlevel(df, variable=\"Updated|\" + projectlevel(df.index, \"variable\"), unit=\"bla\")\n", "# or: df.pix.assign(variable=\"Updated|\" + df.index.pix.project(\"variable\"), unit=\"bla\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "1349c89d", "metadata": {}, "source": [ "This particular case is even more clearly handled with `formatlevel`:" ] }, { "cell_type": "code", "execution_count": null, "id": "f4cb55c2", "metadata": {}, "outputs": [], "source": [ "from pandas_indexing import formatlevel\n", "\n", "\n", "formatlevel(df, variable=\"Updated|{variable}\", unit=\"bla\")\n", "# or: df.pix.format(variable=...)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "a8416465", "metadata": {}, "source": [ "Both functions avoid having to rely on `reset_index`, `set_index` pairs, which are painful for large data, since `set_index` is expensive!" ] }, { "cell_type": "code", "execution_count": null, "id": "72d60d5e", "metadata": {}, "outputs": [], "source": [ "df.reset_index().assign(variable=\"Capacity\").set_index(df.index.names)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "63103fb7", "metadata": {}, "source": [ "# Examining level values and level combinations\n", "\n", "We already encountered the possibility to get an overview of the available levels and their values with describelevel:" ] }, { "cell_type": "code", "execution_count": null, "id": "029603e9", "metadata": {}, "outputs": [], "source": [ "describelevel(df) # or: df.idx" ] }, { "attachments": {}, "cell_type": "markdown", "id": "5938866f", "metadata": {}, "source": [ "Often it is necessary to get programmatic access to the unique values of one or more levels:" ] }, { "cell_type": "code", "execution_count": null, "id": "f9fe7174", "metadata": {}, "outputs": [], "source": [ "from pandas_indexing import uniquelevel\n", "\n", "\n", "uniquelevel(df, \"variable\")\n", "# or: df.pix.unique(\"variable\")\n", "# or in vanilla pandas: df.index.unique(\"variable\")" ] }, { "cell_type": "code", "execution_count": null, "id": "23378320", "metadata": {}, "outputs": [], "source": [ "uniquelevel(df, [\"variable\", \"unit\"])" ] }, { "attachments": {}, "cell_type": "markdown", "id": "c84e2a30-1880-4687-b450-6c5b4dac3c01", "metadata": {}, "source": [ "# BEWARE: Pitfalls" ] }, { "attachments": {}, "cell_type": "markdown", "id": "ec137242", "metadata": {}, "source": [ "`concat` ignores level order, so make sure to `reorder_levels` them" ] }, { "cell_type": "code", "execution_count": null, "id": "24d42af8-ffd3-4281-9d75-1449046fd6fd", "metadata": {}, "outputs": [], "source": [ "pd.concat([simple_fossil_series, simple_fossil_series.swaplevel()])" ] }, { "cell_type": "code", "execution_count": null, "id": "254197c5-8498-4c69-a30d-044e7edb7b53", "metadata": {}, "outputs": [], "source": [ "pd.concat(\n", " [\n", " simple_fossil_series,\n", " simple_fossil_series.swaplevel().reorder_levels(\n", " simple_fossil_series.index.names\n", " ),\n", " ]\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "ac53aa0b", "metadata": {}, "source": [ "Therefore, `pandas-indexing` brings a variant which does this automatically:" ] }, { "cell_type": "code", "execution_count": null, "id": "ef1d54c6", "metadata": {}, "outputs": [], "source": [ "from pandas_indexing import concat\n", "\n", "\n", "concat([simple_fossil_series, simple_fossil_series.swaplevel()])" ] }, { "attachments": {}, "cell_type": "markdown", "id": "c91c2ef3", "metadata": {}, "source": [ "# Additional helpful multi-index helpers" ] }, { "attachments": {}, "cell_type": "markdown", "id": "46c57e84", "metadata": {}, "source": [ "MultiIndex rendering is often annoying to read, since the important information might get abbreviated away, then converting it into a dataframe is helpful" ] }, { "cell_type": "code", "execution_count": null, "id": "5b7cb7db", "metadata": {}, "outputs": [], "source": [ "projectlevel(fossil_series.index, [\"model\", \"scenario\", \"fuel\"])" ] }, { "cell_type": "code", "execution_count": null, "id": "5b7cb7db", "metadata": {}, "outputs": [], "source": [ "projectlevel(fossil_series.index, [\"model\", \"scenario\", \"fuel\"]).to_frame(index=False)" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.2" } }, "nbformat": 4, "nbformat_minor": 5 }