{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "8096b98a",
   "metadata": {},
   "source": [
    "Here we are giving a brief introduction in working with IAMC-styled data with pandas and pandas-indexing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bee121f7",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "814f4bb8-be13-47cd-9853-eb23f61a1da3",
   "metadata": {},
   "source": [
    "# Test data set\n",
    "\n",
    "For experimenting and easy testing `pandas-indexing` brings along the power sector generation and capacity of the HighRE illustrative modelling pathway from the IPCC AR6 scenario database in IAMC format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "48012a50-1881-4c41-abad-4062828331a5",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas_indexing.datasets import remindhighre_power\n",
    "\n",
    "\n",
    "df = remindhighre_power()\n",
    "df.head()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "0d61c065",
   "metadata": {},
   "source": [
    "# Usage styles\n",
    "\n",
    "`pandas-indexing` defines two different usage styles:\n",
    "\n",
    "1. functions that can be imported from the toplevel module, like\n",
    "   \n",
    "   ```python\n",
    "   from pandas_indexing import assignlevel\n",
    "   assignlevel(df, unit=\"Mt CO2e/yr\")\n",
    "   ```\n",
    "2. convenience accessors that are hooking into pandas as extensions, like\n",
    "   \n",
    "   ```python\n",
    "   df.pix.assign(unit=\"Mt CO2e/yr)\n",
    "   ```\n",
    "\n",
    "Most of the functionality is available with both styles under slightly different names. I'll present the functional style here first (and add the alternative as comments)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e7586672",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas_indexing.core import describelevel\n",
    "\n",
    "\n",
    "describelevel(df)  # or: df.pix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a4f23678",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.pix"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "5b948e75",
   "metadata": {},
   "source": [
    "As one can see the IAMC format is defined by five index levels: `model`, `scenario`, `variable`, `unit` and `region`. In this data subset, we have a single `model`, `scenario` combination for one `region` and with several capacity `variable`s starting with `Capacity|Electricity|` and generation variables starting with `Secondary Energy|Electricity|`.\n",
    "\n",
    "The data comes with two different units: `GW` and `GWh/yr`, (hopefully) for capacity and generation, respectively."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "61d70ee2",
   "metadata": {},
   "source": [
    "# Selecting data\n",
    "\n",
    "For using pandas indexes effectively for computations, it makes sense to split the hierarchically variable index out into separate python variables: `generation` and `capacity`. The standard pandas tools for this job are `pd.DataFrame.loc` in conjunction with `pd.IndexSlice` or `pd.DataFrame.query`. \n",
    "\n",
    "`pandas_indexing` brings `ismatch` and `isin` to make this job as easy as possible."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "36d58553",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas_indexing import isin, ismatch  # no .idx equivalents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "982b9663",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.loc[ismatch(variable=\"Capacity|**\"), 2030]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "5f0dfdd8",
   "metadata": {},
   "source": [
    "`ismatch` allows using a glob-like pattern to subset into one or multiple named levels, together with the standard `rename` method we can get cleaned up capacity and generation data easily:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "711f2683",
   "metadata": {},
   "outputs": [],
   "source": [
    "generation = df.loc[ismatch(variable=\"Secondary Energy|**\")].rename(\n",
    "    index=lambda s: s.removeprefix(\"Secondary Energy|Electricity|\")\n",
    ")\n",
    "generation"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "3087eb3c",
   "metadata": {},
   "source": [
    "Since this extraction of data is relatively common, `extractlevel` simplifies this by matching against a format-like template string:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "16032dd2",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas_indexing import extractlevel, formatlevel\n",
    "\n",
    "\n",
    "generation = extractlevel(df, variable=\"Secondary Energy|{carrier}|{fuel}\", drop=True)\n",
    "capacity = extractlevel(df, variable=\"Capacity|{carrier}|{fuel}\", drop=True)\n",
    "# or: df.pix.extract(variable=\"Secondary Energy|{carrier}|{fuel}\")\n",
    "generation"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "405720c7",
   "metadata": {},
   "source": [
    "The inverse operation is to combine strings back together with `formatlevel`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8e355684",
   "metadata": {},
   "outputs": [],
   "source": [
    "formatlevel(generation, variable=\"Secondary Energy|{carrier}|{fuel}\", drop=True)\n",
    "# or: df.pix.format(variable=\"Secondary Energy|{carrier}|{fuel}\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "e78ace7d",
   "metadata": {},
   "source": [
    "With `generation` and `capacity` conveniently split into separate variables, we can calculate capacity factors (ratios of generation and capacity) directly, as long as we take care of removing the conflicting `unit` level. Similarly to `ismatch`, `isin` can be provided as an argument to `.loc[]` to select on named index levels with the difference that only exact matches are considered."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9b4be2b0",
   "metadata": {},
   "outputs": [],
   "source": [
    "capacity_factor = generation.droplevel(\"unit\") / 8760 / capacity.droplevel(\"unit\")\n",
    "capacity_factor.loc[isin(fuel=[\"Solar\", \"Wind\", \"Hydro\", \"Geothermal\"]), 2030:2051]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d62fabd9",
   "metadata": {},
   "source": [
    "Instead of dropping the `unit` level, there is also a set of unit-aware calculation functions, so that this full capacity factor calculation can be performed in very few steps (the unit aware calculation realizes correctly that the capacity factor is unit-less):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "94c1f308",
   "metadata": {},
   "outputs": [],
   "source": [
    "generation = extractlevel(df, variable=\"Secondary Energy|{carrier}|{fuel}\", drop=True)\n",
    "capacity = extractlevel(df, variable=\"Capacity|{carrier}|{fuel}\", drop=True)\n",
    "generation.pix.unitdiv(capacity)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "b4d501a3",
   "metadata": {},
   "source": [
    "Under the hood `isin` and `ismatch` generate `Selector` objects. They can be composed into complex queries intuitively, which are kept as a hierarchical structure of objects."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4197be7d",
   "metadata": {},
   "outputs": [],
   "source": [
    "query = isin(fuel=[\"Coal\", \"Gas\", \"Nuclear\"], unit=\"GW\") & ~ismatch(fuel=\"S*\")\n",
    "query"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "f137127f",
   "metadata": {},
   "source": [
    "For evaluating such a query one needs to pass in a data object to produce a boolean mask. Since pandas `.loc` indexer does exactly that, these queries work as expected.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "18401aa0",
   "metadata": {},
   "outputs": [],
   "source": [
    "query(capacity)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "7e1d4f00",
   "metadata": {},
   "source": [
    "````{note}\n",
    "It is only possible from version 0.5.2 to use a pandas boolean series **in front of** a selector; ie.\n",
    "```python\n",
    "(capacity[2030] > 250) & isin(variable=[\"Coal\", \"Gas\", \"Nuclear\"], unit=\"GW\")\n",
    "```\n",
    "works, as you would expect it, in the same way as\n",
    "```python\n",
    "isin(variable=[\"Coal\", \"Gas\", \"Nuclear\"], unit=\"GW\") & (capacity[2030] > 250)\n",
    "```\n",
    "````"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8b421dba",
   "metadata": {},
   "outputs": [],
   "source": [
    "high_capacity_fossil = capacity.loc[\n",
    "    isin(fuel=[\"Coal\", \"Gas\", \"Nuclear\"], unit=\"GW\") & (capacity[2030] > 250),\n",
    "    :2041,\n",
    "]\n",
    "high_capacity_fossil"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "50fe70b7",
   "metadata": {},
   "source": [
    "The simple fact that this is an operation on `[]`, means that we can also use it to modify values in-place:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "71d7aa7d",
   "metadata": {},
   "outputs": [],
   "source": [
    "high_capacity_fossil.loc[isin(fuel=\"Gas\"), 2030:] = 1000.0\n",
    "high_capacity_fossil"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "25dd3f6e",
   "metadata": {},
   "source": [
    "Most methods in `pandas_indexing` do not care whether they are run on an index, a series or a dataframe, but will transiently take care of handing them down to the appropriate level:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e0ff32d5",
   "metadata": {},
   "outputs": [],
   "source": [
    "fossil_series = (\n",
    "    capacity.loc[isin(fuel=[\"Coal\", \"Gas\", \"Nuclear\"]), [2030, 2040, 2050, 2060]]\n",
    "    .rename_axis(columns=\"year\")\n",
    "    .stack()\n",
    ")\n",
    "fossil_series"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6d4686e6",
   "metadata": {},
   "outputs": [],
   "source": [
    "fossil_series.loc[isin(year=[2030, 2050])]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8d4e91fb",
   "metadata": {},
   "outputs": [],
   "source": [
    "isin(fossil_series.index, fuel=\"Nuclear\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "36c41351",
   "metadata": {},
   "source": [
    "# Selecting based on a multi-index\n",
    "\n",
    "If we need pairs of data like `Coal` in 2030 and `Gas` in 2035 and `Nuclear` in 2040 and 2050, then we can pass a multiindex to `isin`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cde4ff9f",
   "metadata": {},
   "outputs": [],
   "source": [
    "idx = pd.MultiIndex.from_tuples(\n",
    "    [(\"Coal\", 2030), (\"Gas\", 2035), (\"Nuclear\", 2040), (\"Nuclear\", 2050)],\n",
    "    names=[\"fuel\", \"year\"],\n",
    ")\n",
    "idx"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7d4edea9",
   "metadata": {},
   "outputs": [],
   "source": [
    "fossil_series.loc[isin(idx)]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8bce1bf9",
   "metadata": {},
   "source": [
    "Since `(\"Gas\", 2035)` is not part of the original `fossil_series` it is silently ignored, just like with other uses of `isin`.\n",
    "\n",
    "Alternatively, the same result can be retrieved with the more powerful `semijoin` using an `\"inner\"` join:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dc294f2f",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas_indexing import semijoin\n",
    "\n",
    "\n",
    "semijoin(\n",
    "    fossil_series, idx, how=\"inner\"\n",
    ")  # or: fossil_series.pix.semijoin(idx, how=\"inner\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "cbaa73b0",
   "metadata": {},
   "source": [
    "\n",
    "A `\"right\"`-join on the other hand will follow the order and keep all elements of the provided `idx`. Since `(\"Gas\", 2035)` is not part of the original `fossil_series` it shows up as `NaN`s here:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f82d6bf4",
   "metadata": {},
   "outputs": [],
   "source": [
    "semijoin(fossil_series, idx, how=\"right\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c420dc1",
   "metadata": {},
   "source": [
    "It is also possible to get the inverted result, with only the not matching rows, with an `antijoin`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bed99ac9",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas_indexing import antijoin\n",
    "\n",
    "\n",
    "antijoin(fossil_series, idx)\n",
    "# or: fossil_series.pix.antijoin(idx)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "f2e1e0e4",
   "metadata": {},
   "source": [
    "# Projecting levels\n",
    "\n",
    "Often after selecting the right subsets, ie the interesting `model` or `scenario` it makes sense to consolidate the data to a given set of `levels`. That is what `projectlevel` is used for:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6d0000aa",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas_indexing import projectlevel\n",
    "\n",
    "\n",
    "simple_fossil_series = projectlevel(fossil_series, [\"fuel\", \"year\"])\n",
    "# or: fossil_series.pix.project([\"fuel\", \"year\"])\n",
    "simple_fossil_series"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "715fe070",
   "metadata": {},
   "source": [
    "`projectlevel` reduces the levels attached to a multiindex to the ones explicitly named. It is basically the complement to `droplevel` which removes the listed names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "308fb920",
   "metadata": {},
   "outputs": [],
   "source": [
    "projectlevel(fossil_series, [\"model\", \"scenario\"]) == fossil_series.droplevel(\n",
    "    [\"carrier\", \"fuel\", \"unit\", \"region\", \"year\"]\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "6eb7b029",
   "metadata": {},
   "source": [
    "# Assigning to levels\n",
    "\n",
    "`assignlevel` allows to modify individual values with helpful keyword arguments,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ca2a3c70",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas_indexing import assignlevel\n",
    "\n",
    "\n",
    "assignlevel(df, variable=\"Updated|\" + projectlevel(df.index, \"variable\"), unit=\"bla\")\n",
    "# or: df.pix.assign(variable=\"Updated|\" + df.index.pix.project(\"variable\"), unit=\"bla\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "1349c89d",
   "metadata": {},
   "source": [
    "This particular case is even more clearly handled with `formatlevel`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f4cb55c2",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas_indexing import formatlevel\n",
    "\n",
    "\n",
    "formatlevel(df, variable=\"Updated|{variable}\", unit=\"bla\")\n",
    "# or: df.pix.format(variable=...)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "a8416465",
   "metadata": {},
   "source": [
    "Both functions avoid having to rely on `reset_index`, `set_index` pairs, which are painful for large data, since `set_index` is expensive!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "72d60d5e",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.reset_index().assign(variable=\"Capacity\").set_index(df.index.names)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "63103fb7",
   "metadata": {},
   "source": [
    "# Examining level values and level combinations\n",
    "\n",
    "We already encountered the possibility to get an overview of the available levels and their values with describelevel:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "029603e9",
   "metadata": {},
   "outputs": [],
   "source": [
    "describelevel(df)  # or: df.idx"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "5938866f",
   "metadata": {},
   "source": [
    "Often it is necessary to get programmatic access to the unique values of one or more levels:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f9fe7174",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas_indexing import uniquelevel\n",
    "\n",
    "\n",
    "uniquelevel(df, \"variable\")\n",
    "# or: df.pix.unique(\"variable\")\n",
    "# or in vanilla pandas: df.index.unique(\"variable\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "23378320",
   "metadata": {},
   "outputs": [],
   "source": [
    "uniquelevel(df, [\"variable\", \"unit\"])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "c84e2a30-1880-4687-b450-6c5b4dac3c01",
   "metadata": {},
   "source": [
    "# BEWARE: Pitfalls"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "ec137242",
   "metadata": {},
   "source": [
    "`concat` ignores level order, so make sure to `reorder_levels` them"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24d42af8-ffd3-4281-9d75-1449046fd6fd",
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.concat([simple_fossil_series, simple_fossil_series.swaplevel()])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "254197c5-8498-4c69-a30d-044e7edb7b53",
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.concat(\n",
    "    [\n",
    "        simple_fossil_series,\n",
    "        simple_fossil_series.swaplevel().reorder_levels(\n",
    "            simple_fossil_series.index.names\n",
    "        ),\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "ac53aa0b",
   "metadata": {},
   "source": [
    "Therefore, `pandas-indexing` brings a variant which does this automatically:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ef1d54c6",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas_indexing import concat\n",
    "\n",
    "\n",
    "concat([simple_fossil_series, simple_fossil_series.swaplevel()])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "c91c2ef3",
   "metadata": {},
   "source": [
    "# Additional helpful multi-index helpers"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "46c57e84",
   "metadata": {},
   "source": [
    "MultiIndex rendering is often annoying to read, since the important information might get abbreviated away, then converting it into a dataframe is helpful"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5b7cb7db",
   "metadata": {},
   "outputs": [],
   "source": [
    "projectlevel(fossil_series.index, [\"model\", \"scenario\", \"fuel\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5b7cb7db",
   "metadata": {},
   "outputs": [],
   "source": [
    "projectlevel(fossil_series.index, [\"model\", \"scenario\", \"fuel\"]).to_frame(index=False)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}