{ "cells": [ { "cell_type": "markdown", "id": "634a6ccd", "metadata": {}, "source": [ "
\n", "Licence CC BY-NC-ND\n", "Valérie Roy\n", "
\n", "\n", "" ] }, { "cell_type": "markdown", "id": "52dbc43f", "metadata": {}, "source": [ "# regroupement de données `pandas.DataFrame.groupby`\n", "\n", "Une table de données `pandas` est en 2 dimensions mais elle peut indiquer des sous-divisions de vos données. Par exemple, les passagers du Titanic sont divisés en femmes et hommes, en passagers de première, deuxième et troisiéme classe. On pourrait même les diviser en classe d'âge, enfants, jeunes, adultes...\n", "\n", "Des analyses mettant en exergue ces différents groupes de personnes peuvent être intéressantes. Lors du naufrage du Titanic, valait-il mieux être une femme en première classe ou un enfant en troisième ?" ] }, { "cell_type": "markdown", "id": "3850d4be", "metadata": {}, "source": [ "On reprend nos données" ] }, { "cell_type": "code", "execution_count": 1, "id": "e724fa34", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "id": "50c24850", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
\n", "
" ], "text/plain": [ " Survived Pclass \\\n", "PassengerId \n", "1 0 3 \n", "2 1 1 \n", "3 1 3 \n", "\n", " Name Sex Age \\\n", "PassengerId \n", "1 Braund, Mr. Owen Harris male 22.0 \n", "2 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 \n", "3 Heikkinen, Miss. Laina female 26.0 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "1 1 0 A/5 21171 7.2500 NaN S \n", "2 1 0 PC 17599 71.2833 C85 C \n", "3 0 0 STON/O2. 3101282 7.9250 NaN S " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "file = 'titanic.csv'\n", "df = pd.read_csv(file, index_col=0)\n", "df.head(3)" ] }, { "cell_type": "markdown", "id": "cca43755", "metadata": {}, "source": [ "On va calculer des regroupements en utilisant la fonction `pandas.DataFrame.groupby`, à laquelle on indique un ou plusieurs critères." ] }, { "cell_type": "markdown", "id": "02f04a33", "metadata": {}, "source": [ "## groupement par critère unique" ] }, { "cell_type": "markdown", "id": "8b61158d", "metadata": {}, "source": [ "Par exemple, prenons le seul critère de genre des passagers (`Sex`). Cette colonne a deux valeurs: `female` et `male`, nous allons donc obtenir une partition de notre dataframe en deux dataframes : celle des hommes et celle des femmes, sur lesquelles nous allons pouvoir faire des analyses (moyenne...) différencées par genre.\n", "\n", "Allons-y:" ] }, { "cell_type": "code", "execution_count": 3, "id": "4368eba3", "metadata": {}, "outputs": [], "source": [ "df_by_sex = df.groupby('Sex')" ] }, { "cell_type": "markdown", "id": "bc737d1a", "metadata": {}, "source": [ "`pandas` calcule les différentes valeurs de la colonne en question (ici `Sex`), et partitionne la dataframe en autant de dataframes que de valeurs différentes." ] }, { "cell_type": "markdown", "id": "01596a65", "metadata": {}, "source": [ "`pandas` met les regroupements dans un objet de type `DataFrameGroupBy` (ici de nom `df_by_sex`) qui vous donne accès à de nombreuses fonctionnalités (regardez le help pour plus de détails), nous allons voir ici très peu de choses ici." ] }, { "cell_type": "markdown", "id": "3f89ff61", "metadata": {}, "source": [ "### les tailles des groupes" ] }, { "cell_type": "markdown", "id": "dad111f5", "metadata": {}, "source": [ "Les objets de type `DataFrameGroupBy` contiennent une fonction très pratique pour récapituler les groupes : `size`." ] }, { "cell_type": "code", "execution_count": 4, "id": "456a9d8a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Sex\n", "female 314\n", "male 577\n", "dtype: int64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_by_sex.size()" ] }, { "cell_type": "markdown", "id": "0f764380", "metadata": {}, "source": [ "### accéder aux sous-dataframes" ] }, { "cell_type": "markdown", "id": "22059550", "metadata": {}, "source": [ "On peut aussi itérer sur un objet de type `DataFrameGroupBy` afin de voir les différentes dataframes." ] }, { "cell_type": "code", "execution_count": 5, "id": "fcaef3d5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "La dataframe de clé 'female' a 314 éléments sur les 891\n", "La dataframe de clé 'male' a 577 éléments sur les 891\n" ] } ], "source": [ "for name, subdf in df_by_sex:\n", " print(f\"La dataframe de clé '{name}' a {subdf.shape[0]} éléments sur les {df.shape[0]}\")" ] }, { "cell_type": "markdown", "id": "5c32fb2e", "metadata": {}, "source": [ "Voilà, la fonction a bien partitionné votre dataframe en autant de dataframes que de genre des personnes." ] }, { "cell_type": "markdown", "id": "3e8bf168", "metadata": {}, "source": [ "### les groupes (dictionnaire d'index par clés)" ] }, { "cell_type": "markdown", "id": "9b2095dc", "metadata": {}, "source": [ "Vous pouvez accéder, au travers du champ `groups` des objets de type `DataFrameGroupBy`, au dictionnaire vous donnant les groupes d'`Index` (ici deux parce `male` et `female`).\n", "\n", "Chaque clé est une des valeurs possibles (donc à nouveau `male` et `female`), et sa valeur est la liste des index des lignes ayant cette valeur:" ] }, { "cell_type": "code", "execution_count": 6, "id": "91ff23cb", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "female\t→ Int64Index([ 2, 3, 4, 9, 10, 11, 12, 15, 16, 19,\n", " ...\n", " 867, 872, 875, 876, 880, 881, 883, 886, 888, 889],\n", " dtype='int64', name='PassengerId', length=314)\n", "male\t→ Int64Index([ 1, 5, 6, 7, 8, 13, 14, 17, 18, 21,\n", " ...\n", " 874, 877, 878, 879, 882, 884, 885, 887, 890, 891],\n", " dtype='int64', name='PassengerId', length=577)\n" ] } ], "source": [ "# df_by_sex.groups est un dictionnaire\n", "# et voici ses clés et valeurs \n", "for k, v in df_by_sex.groups.items():\n", " print(f\"{k}\\t→ {v}\")" ] }, { "cell_type": "markdown", "id": "a9ad056d", "metadata": {}, "source": [ "## groupement multi-critères" ] }, { "cell_type": "markdown", "id": "c5b34a50", "metadata": {}, "source": [ "Si maintenant on s'intéresse à plusieurs colonnes ? Comment est-ce que ça pourrait se présenter à votre avis ?\n", "\n", "La solution adoptée, c'est de passer à `groupby`, non plus une seule colonne mais .. une liste de colonnes. \n", "\n", "Le fonctionnement de `groupby` dans ce cas consiste à \n", "\n", "* calculer pour chaque colonne les valeurs distinctes (comme dans le cas simple)\n", "* et en faire le **produit cartésien** pour obtenir les clés du groupement (incidemment, sous la forme de tuples)\n", "\n", "Ainsi dans notre exemple si nous prenons les critères : `Pclass` et`Sex`:\n", "\n", "* le premier critère donne trois valeurs `1`, `2` et `3` pour les trois classes de navigation\n", "* le second donne 2 valeurs `female` et `male`\n", "\n", "et donc on va avoir 6 tuples qui serviront de clés pour le groupement : (1, 'female'), (1, 'male'), (2, 'female')..." ] }, { "cell_type": "code", "execution_count": 7, "id": "70b588be", "metadata": {}, "outputs": [], "source": [ "df_by_sex_class = df.groupby(['Pclass', 'Sex'])" ] }, { "cell_type": "markdown", "id": "88078028", "metadata": {}, "source": [ "### les tailles des groupes" ] }, { "cell_type": "markdown", "id": "3318f9f8", "metadata": {}, "source": [ "Pour faire une synthèse on peut utiliser `size()` pour récapituler les groupes; la présentation nous montre bien le produit cartésien qui a été fait" ] }, { "cell_type": "code", "execution_count": 8, "id": "28a5d990", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Pclass Sex \n", "1 female 94\n", " male 122\n", "2 female 76\n", " male 108\n", "3 female 144\n", " male 347\n", "dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_by_sex_class.size()" ] }, { "cell_type": "markdown", "id": "73172a19", "metadata": {}, "source": [ "En une seule ligne:" ] }, { "cell_type": "code", "execution_count": 9, "id": "b8e3fae4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Pclass Sex \n", "1 female 94\n", " male 122\n", "2 female 76\n", " male 108\n", "3 female 144\n", " male 347\n", "dtype: int64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(['Pclass', 'Sex']).size()" ] }, { "cell_type": "markdown", "id": "e92466bd", "metadata": {}, "source": [ "### accéder aux sous-dataframes" ] }, { "cell_type": "markdown", "id": "26c767fa", "metadata": {}, "source": [ "De même nous pouvons itérer sur les sous-dataframes." ] }, { "cell_type": "code", "execution_count": 10, "id": "0981965e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "94 passagers en classe '1' de genre 'female'\n", "122 passagers en classe '1' de genre 'male'\n", "76 passagers en classe '2' de genre 'female'\n", "108 passagers en classe '2' de genre 'male'\n", "144 passagers en classe '3' de genre 'female'\n", "347 passagers en classe '3' de genre 'male'\n" ] } ], "source": [ "# pour itérer sur les 6 catégories\n", "for name, dataframe in df_by_sex_class:\n", " print(f\"{len(dataframe)} passagers en classe '{name[0]}' de genre '{name[1]}'\")" ] }, { "cell_type": "markdown", "id": "b08ec883", "metadata": { "tags": [ "level_intermediate" ] }, "source": [ "Pour les curieux, une petite astuce, utile à ce stade; on pourrait avoir envie d'utiliser la méthode `.head()` pour afficher chacune des sous-dataframes, en écrivant ceci" ] }, { "cell_type": "code", "execution_count": 11, "id": "2dbc4ca3", "metadata": { "tags": [ "level_intermediate" ] }, "outputs": [], "source": [ "# malheureusement ceci ne marche pas !!\n", "for name, dataframe in df_by_sex_class:\n", " dataframe.head(1)" ] }, { "cell_type": "markdown", "id": "29648c38", "metadata": { "tags": [ "level_intermediate" ] }, "source": [ "En fait ce qui se passe ici, c'est que la méthode `.head()` renvoie un objet que le notebook sait afficher; donc quand on écrit une cellule qui ne contient que (ou dont la dernière instruction est) `df.head()`, cet objet se fait afficher parce que c'est le résultat de la cellule (comme quand `40` se fait affichez quand vous évaluez une cellule avec juste `10+30`)\n", "\n", "Mais dans le cas du `for` qu'on est en train de vouloir écrire, le résultat du for est `None`, et rien ne se fait afficher; il nous faut donc provoquer l'affichage (un peu comme quand on est obligé d'insérer un `print()` au milieu d'une cellule); voici l'astuce pour arriver à provoquer l'affichage souhaité :" ] }, { "cell_type": "code", "execution_count": 12, "id": "52cf3af3", "metadata": { "tags": [ "level_intermediate" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
\n", "
" ], "text/plain": [ " Survived Pclass \\\n", "PassengerId \n", "2 1 1 \n", "\n", " Name Sex Age \\\n", "PassengerId \n", "2 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "2 1 0 PC 17599 71.2833 C85 C " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
\n", "
" ], "text/plain": [ " Survived Pclass Name Sex Age SibSp \\\n", "PassengerId \n", "7 0 1 McCarthy, Mr. Timothy J male 54.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "7 0 17463 51.8625 E46 S " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
1012Nasser, Mrs. Nicholas (Adele Achem)female14.01023773630.0708NaNC
\n", "
" ], "text/plain": [ " Survived Pclass Name Sex \\\n", "PassengerId \n", "10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female \n", "\n", " Age SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "10 14.0 1 0 237736 30.0708 NaN C " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
1812Williams, Mr. Charles EugenemaleNaN0024437313.0NaNS
\n", "
" ], "text/plain": [ " Survived Pclass Name Sex Age SibSp \\\n", "PassengerId \n", "18 1 2 Williams, Mr. Charles Eugene male NaN 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "18 0 244373 13.0 NaN S " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.925NaNS
\n", "
" ], "text/plain": [ " Survived Pclass Name Sex Age SibSp \\\n", "PassengerId \n", "3 1 3 Heikkinen, Miss. Laina female 26.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "3 0 STON/O2. 3101282 7.925 NaN S " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
103Braund, Mr. Owen Harrismale22.010A/5 211717.25NaNS
\n", "
" ], "text/plain": [ " Survived Pclass Name Sex Age SibSp \\\n", "PassengerId \n", "1 0 3 Braund, Mr. Owen Harris male 22.0 1 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "1 0 A/5 21171 7.25 NaN S " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# pour que ça fonctionne il faut forcer l'affichage \n", "# en utilisant display() qui se trouve dans le module IPython\n", "import IPython \n", "for name, dataframe in df_by_sex_class:\n", " IPython.display.display(dataframe.head(1))" ] }, { "cell_type": "markdown", "id": "76154ce6", "metadata": {}, "source": [ "À vous de jouer : calculer le `groupby` avec le genre, la classe et si la personne a survécu ou non. Dans quel groupe de personnes reste-il le plus de survivants ? Et le moins ?" ] }, { "cell_type": "code", "execution_count": 13, "id": "2140ad63", "metadata": {}, "outputs": [], "source": [ "# votre code ici (une petite idée de correction ci-dessous)" ] }, { "cell_type": "code", "execution_count": 14, "id": "6efd3898", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Pclass Sex Survived\n", "1 female 0 3\n", " 1 91\n", " male 0 77\n", " 1 45\n", "2 female 0 6\n", " 1 70\n", " male 0 91\n", " 1 17\n", "3 female 0 72\n", " 1 72\n", " male 0 300\n", " 1 47\n", "dtype: int64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_by_sex_class_survived = df.groupby(['Pclass', 'Sex', 'Survived'])\n", "df_by_sex_class_survived.size()" ] }, { "cell_type": "markdown", "id": "5d8822ae", "metadata": {}, "source": [ "### les groupes (dictionnaire d'index par clés)" ] }, { "cell_type": "markdown", "id": "aa3c1ca8", "metadata": {}, "source": [ "Lister les clés\n", "\n", "Les clés sont des tuples de valeurs." ] }, { "cell_type": "code", "execution_count": 15, "id": "03da8298", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys([(1, 'female', 0), (1, 'female', 1), (1, 'male', 0), (1, 'male', 1), (2, 'female', 0), (2, 'female', 1), (2, 'male', 0), (2, 'male', 1), (3, 'female', 0), (3, 'female', 1), (3, 'male', 0), (3, 'male', 1)])" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_by_sex_class_survived.groups.keys()" ] }, { "cell_type": "markdown", "id": "22b80c18", "metadata": {}, "source": [ "Les valeurs sont des listes d'index, ça me permet de retrouver les entrées dans la dataframe d'origine." ] }, { "cell_type": "markdown", "id": "1af449c6", "metadata": {}, "source": [ "Par exemple, si nous voulons accéder aux trois seules femmes de première classe qui n'ont pas survécu. La clé est `(1, 'female', 0)`.\n", "\n", "Nous allons, cette fois, les rechercher dans la grande dataframe. Remarquez que bien sûr ici on utilise `loc` puisque nous sommes uniquement dans l'espace des index." ] }, { "cell_type": "code", "execution_count": 16, "id": "4f53aea3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
17801Isham, Miss. Ann Elizabethfemale50.000PC 1759528.7125C49C
29801Allison, Miss. Helen Lorainefemale2.012113781151.5500C22 C26S
49901Allison, Mrs. Hudson J C (Bessie Waldo Daniels)female25.012113781151.5500C22 C26S
\n", "
" ], "text/plain": [ " Survived Pclass \\\n", "PassengerId \n", "178 0 1 \n", "298 0 1 \n", "499 0 1 \n", "\n", " Name Sex Age \\\n", "PassengerId \n", "178 Isham, Miss. Ann Elizabeth female 50.0 \n", "298 Allison, Miss. Helen Loraine female 2.0 \n", "499 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "178 0 0 PC 17595 28.7125 C49 C \n", "298 1 2 113781 151.5500 C22 C26 S \n", "499 1 2 113781 151.5500 C22 C26 S " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[df_by_sex_class_survived.groups[(1, 'female', 0)]]" ] }, { "cell_type": "markdown", "id": "54e163f9", "metadata": {}, "source": [ "## découper des intervalles de valeurs dans une colonne" ] }, { "cell_type": "markdown", "id": "ac6475f1", "metadata": {}, "source": [ "Parfois, nous sommes intéressés, non pas par les différentes valeurs d'une colonne (qui seraient trop nombreuses) mais par des intervalles de ces valeurs.\n", "\n", "Prenons par exemple la colonne des âges. Si nous faisons un groupement brutalement sur la colonne `Age`, nous obtenons 88 âges différents, ce qui ne nous apporte pas une information intéressante.\n", "\n", "Par contre ça devient plus intéressant si on raisonne par **classes** d'âges, par exemple les enfants, jeunes, adultes et les plus de 55 ans.\n", " - *'enfant'* disons entre 0 et 12 ans\n", " - *'jeune'* disons entre 12 et 19 ans\n", " - *'adulte'* disons entre 19 et 55 ans\n", " - *'+55'* et les personnes de plus de 55 ans\n", " \n", "Nous aimerions donc avoir une colonne dans notre dataframe avec ces labels pour compléter les âges.\n", " \n", "La fonction `pandas.cut`, appliquée à une colonne de votre dataframe, va vous générer une telle colonne, et vous pouvez donner des labels aux intervalles:" ] }, { "cell_type": "markdown", "id": "890254bb", "metadata": {}, "source": [ "Nous allons rajouter la colonne à la dataframe. Sachant que les colonnes d'une dataframe sont les clés d'un dictionnaire, pour ajouter une colonne à votre dataframe, il faut faire comme pour les `dict` en Python." ] }, { "cell_type": "code", "execution_count": 17, "id": "513e004f", "metadata": {}, "outputs": [], "source": [ "# 'bin' en anglais signifie corbeille ou panier\n", "# c'est comme si on mettait les gens dans 4 paniers\n", "\n", "df['age-periode'] = pd.cut(df['Age'], bins=[0, 12, 19, 55, 100])" ] }, { "cell_type": "markdown", "id": "2c0d2f2e", "metadata": {}, "source": [ "Je montre les 6 premières lignes des 3 dernières colonnes de la dataframe:" ] }, { "cell_type": "code", "execution_count": 18, "id": "34dc43a1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CabinEmbarkedage-periode
PassengerId
1NaNS(19.0, 55.0]
2C85C(19.0, 55.0]
3NaNS(19.0, 55.0]
4C123S(19.0, 55.0]
5NaNS(19.0, 55.0]
6NaNQNaN
\n", "
" ], "text/plain": [ " Cabin Embarked age-periode\n", "PassengerId \n", "1 NaN S (19.0, 55.0]\n", "2 C85 C (19.0, 55.0]\n", "3 NaN S (19.0, 55.0]\n", "4 C123 S (19.0, 55.0]\n", "5 NaN S (19.0, 55.0]\n", "6 NaN Q NaN" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# on a rajouté une colonne \n", "df[df.columns[-3:]].head(6)" ] }, { "cell_type": "markdown", "id": "7aa0569e", "metadata": {}, "source": [ "Je donne des noms aux périodes d'âge (ici on va rajouter encore une colonne)" ] }, { "cell_type": "code", "execution_count": 19, "id": "430fe25f", "metadata": {}, "outputs": [], "source": [ "df['name-age-periode'] = pd.cut(df['Age'], bins=[0, 12, 19, 55, 100], \n", " labels=['children', ' young', 'adult', 'old'])" ] }, { "cell_type": "code", "execution_count": 20, "id": "b14c8b18", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Embarkedage-periodename-age-periode
PassengerId
1S(19.0, 55.0]adult
2C(19.0, 55.0]adult
3S(19.0, 55.0]adult
4S(19.0, 55.0]adult
5S(19.0, 55.0]adult
6QNaNNaN
\n", "
" ], "text/plain": [ " Embarked age-periode name-age-periode\n", "PassengerId \n", "1 S (19.0, 55.0] adult\n", "2 C (19.0, 55.0] adult\n", "3 S (19.0, 55.0] adult\n", "4 S (19.0, 55.0] adult\n", "5 S (19.0, 55.0] adult\n", "6 Q NaN NaN" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df.columns[-3:]].head(6)" ] }, { "cell_type": "markdown", "id": "7b286fd2", "metadata": {}, "source": [ "Et maintenant nous pouvons utiliser cette colonne dans des `groupby`" ] }, { "cell_type": "code", "execution_count": 21, "id": "477aa2e7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "name-age-periode\n", "children 69\n", " young 95\n", "adult 510\n", "old 40\n", "dtype: int64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(['name-age-periode']).size()" ] }, { "cell_type": "code", "execution_count": 22, "id": "caf8f0e3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "name-age-periode Survived\n", "children 0 29\n", " 1 40\n", " young 0 56\n", " 1 39\n", "adult 0 311\n", " 1 199\n", "old 0 28\n", " 1 12\n", "dtype: int64" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# etc...\n", "df.groupby(['name-age-periode', 'Survived']).size()" ] }, { "cell_type": "code", "execution_count": 23, "id": "51fd2126", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Pclass Sex Survived\n", "1 female 0 3\n", " 1 91\n", " male 0 77\n", " 1 45\n", "2 female 0 6\n", " 1 70\n", " male 0 91\n", " 1 17\n", "3 female 0 72\n", " 1 72\n", " male 0 300\n", " 1 47\n", "dtype: int64" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# etc..\n", "df.groupby(['Pclass', 'Sex', 'Survived', ]).size()" ] }, { "cell_type": "markdown", "id": "6d584cff", "metadata": {}, "source": [ "Vous avez désormais une petite idée de l'utilisation de la fonction `groupby` pour des recherches multi-critères sur une table de données." ] } ], "metadata": { "jupytext": { "cell_metadata_filter": "all,-hidden,-heading_collapsed", "cell_metadata_json": true, "notebook_metadata_filter": "all,-language_info,-toc,-jupytext.text_representation.jupytext_version,-jupytext.text_representation.format_version", "text_representation": { "extension": ".md", "format_name": "myst" } }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.5" }, "notebookname": "regroupements", "source_map": [ 15, 24, 32, 36, 41, 45, 49, 53, 59, 61, 65, 69, 73, 77, 79, 83, 87, 90, 94, 98, 104, 109, 113, 131, 133, 137, 141, 143, 147, 149, 153, 157, 163, 167, 175, 181, 189, 193, 197, 200, 204, 210, 212, 216, 222, 224, 228, 244, 248, 253, 257, 260, 264, 269, 271, 275, 279, 284, 287 ], "version": "1.0" }, "nbformat": 4, "nbformat_minor": 5 }