{ "cells": [ { "cell_type": "markdown", "id": "a8c0f52e", "metadata": { "tags": [] }, "source": [ "# modifications et *slicing* de dataframe\n", "\n", "Où on apprend à découper et modifier des parties de dataframe" ] }, { "cell_type": "markdown", "id": "f011bf85", "metadata": {}, "source": [ "Nous allons nous intéresser dans ce notebook à la manière de découper (trancher) slicer les objets `pandas` comme des séries ou des dataframes, et à les manipuler. C'est souvent ce que vous allez faire sur vos tables: appliquer une fonction à une sous-partie de vos données." ] }, { "cell_type": "markdown", "id": "a39d85b8", "metadata": {}, "source": [ "Importons nos bibliothèques et nous allons lire une table des passagers du Titanic pour servir d'exemple." ] }, { "cell_type": "code", "execution_count": 1, "id": "17fb9a6a", "metadata": { "tags": [] }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "id": "09d90019", "metadata": {}, "source": [ "Lisons notre dataframe du Titanic et passons lui comme index des lignes, la colonne `PassengerId`." ] }, { "cell_type": "code", "execution_count": 2, "id": "7733d3f1", "metadata": {}, "outputs": [], "source": [ "file = 'titanic.csv'\n", "df = pd.read_csv(file, index_col='PassengerId')" ] }, { "cell_type": "markdown", "id": "2912a031", "metadata": {}, "source": [ "et aussi, comme dans le notebook précédent on va le trier par âge histoire de bien voir la différence entre les index et les indices" ] }, { "cell_type": "code", "execution_count": 3, "id": "7d6df860", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
80413Thomas, Master. Assad Alexandermale0.420126258.5167NaNC
75612Hamalainen, Master. Viljomale0.671125064914.5000NaNS
64513Baclini, Miss. Eugeniefemale0.7521266619.2583NaNC
\n", "
" ], "text/plain": [ " Survived Pclass Name Sex Age \\\n", "PassengerId \n", "804 1 3 Thomas, Master. Assad Alexander male 0.42 \n", "756 1 2 Hamalainen, Master. Viljo male 0.67 \n", "645 1 3 Baclini, Miss. Eugenie female 0.75 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "804 0 1 2625 8.5167 NaN C \n", "756 1 1 250649 14.5000 NaN S \n", "645 2 1 2666 19.2583 NaN C " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sort_values(by='Age', inplace=True)\n", "df.head(3)" ] }, { "cell_type": "markdown", "id": "5600e7c1", "metadata": {}, "source": [ "## copier une dataframe" ] }, { "cell_type": "markdown", "id": "14f61801", "metadata": {}, "source": [ "Une chose que nous pouvons apprendre est à copier une dataframe. Pour cela il faut utiliser la méthodes `copy` des `pandas.DataFrame`." ] }, { "cell_type": "code", "execution_count": 4, "id": "c1642aff", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
80413Thomas, Master. Assad Alexandermale0.420126258.5167NaNC
75612Hamalainen, Master. Viljomale0.671125064914.5000NaNS
64513Baclini, Miss. Eugeniefemale0.7521266619.2583NaNC
\n", "
" ], "text/plain": [ " Survived Pclass Name Sex Age \\\n", "PassengerId \n", "804 1 3 Thomas, Master. Assad Alexander male 0.42 \n", "756 1 2 Hamalainen, Master. Viljo male 0.67 \n", "645 1 3 Baclini, Miss. Eugenie female 0.75 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "804 0 1 2625 8.5167 NaN C \n", "756 1 1 250649 14.5000 NaN S \n", "645 2 1 2666 19.2583 NaN C " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_copy = df.copy()\n", "df_copy.head(3) # df_copy est une nouvelle dataframe jumelle de l'originale" ] }, { "cell_type": "markdown", "id": "ca7aea3f", "metadata": {}, "source": [ "voilà `df_copy` est une nouvelle dataframe avec les mêmes valeurs que l'originale mais totalement indépendante." ] }, { "cell_type": "markdown", "id": "6b6a182b", "metadata": {}, "source": [ "## créer une nouvelle colonne" ] }, { "cell_type": "markdown", "id": "87bdd08d", "metadata": {}, "source": [ "Il est souvent pratique de créer une nouvelle colonne, en faisant un calcul à partir des colonnes existantes. \n", "Les opérations sur les colonnes sont, en pratique, les seules opérations qui utilisent la forme `df[nom_de_colonne]`" ] }, { "cell_type": "code", "execution_count": 5, "id": "cdf9cb65", "metadata": {}, "outputs": [], "source": [ "# pour créer une nouvelle colonne\n", "# par exemple ici je vais ajouter une colonne 'Deceased'\n", "# qui est simplement l'opposé de 'Survived'\n", "\n", "df['Deceased'] = 1 - df['Survived']" ] }, { "cell_type": "code", "execution_count": 6, "id": "5bfce3ba", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedDeceased
PassengerId
80413Thomas, Master. Assad Alexandermale0.420126258.5167NaNC0
75612Hamalainen, Master. Viljomale0.671125064914.5000NaNS0
64513Baclini, Miss. Eugeniefemale0.7521266619.2583NaNC0
\n", "
" ], "text/plain": [ " Survived Pclass Name Sex Age \\\n", "PassengerId \n", "804 1 3 Thomas, Master. Assad Alexander male 0.42 \n", "756 1 2 Hamalainen, Master. Viljo male 0.67 \n", "645 1 3 Baclini, Miss. Eugenie female 0.75 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked Deceased \n", "PassengerId \n", "804 0 1 2625 8.5167 NaN C 0 \n", "756 1 1 250649 14.5000 NaN S 0 \n", "645 2 1 2666 19.2583 NaN C 0 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(3)" ] }, { "cell_type": "markdown", "id": "bf6e7b5d", "metadata": {}, "source": [ "## contextualisons l'accès et la modification de parties d'une dataframe" ] }, { "cell_type": "markdown", "id": "cd738f24", "metadata": {}, "source": [ "Pour accéder ou modifier des sous-parties de dataframe, vous pourriez être tenté d'utiliser les syntaxes classiques d'accès aux éléments d'un tableau par leur indice, comme vous le feriez en Python.\n", "\n", "Comme par exemple en Python:" ] }, { "cell_type": "code", "execution_count": 7, "id": "7cefdf70", "metadata": { "cell_style": "split" }, "outputs": [ { "data": { "text/plain": [ "['Hello !', 56, 34]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "L = [-12, 56, 34]\n", "L[0] = \"Hello !\"\n", "L" ] }, { "cell_type": "code", "execution_count": 8, "id": "c800a0e4", "metadata": { "cell_style": "split" }, "outputs": [ { "data": { "text/plain": [ "['Hello !', 100, 200, 300, 34]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "L[1:2] = [100, 200, 300]\n", "L" ] }, { "cell_type": "markdown", "id": "49c9edfc", "metadata": {}, "source": [ "Ou encore, d'utiliser l'accès à un tableau par une paires d'**indices**, comme vous le feriez en `numpy`:" ] }, { "cell_type": "code", "execution_count": 9, "id": "0e61df7d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[10000 1 2 3 10000]\n", " [ 5 6 7 8 9]\n", " [ 10 11 100 13 14]\n", " [ 15 16 17 18 19]\n", " [10000 21 22 23 10000]]\n" ] }, { "data": { "text/plain": [ "array([10000, 1, 2, 3, 10000])" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mat = np.arange(25).reshape((5, 5)) # je crée la matrice 5x5 d'éléments 0 à 24\n", "mat[2, 2] = 100 # je modifie l'élément au milieu\n", "mat[::4, ::4] = 10000 # je modifie les 4 coins (::4 = du début à la fin avec un pas 4)\n", "print(mat) # j'affiche la matrice\n", "mat[0] # j'accède à sa première ligne" ] }, { "cell_type": "markdown", "id": "9b0f676b", "metadata": {}, "source": [ "Mais voilà en `pandas`, c'est très différent: comme on l'a vu déjà, ils ont mis leurs efforts sur la gestion d'une indexation des lignes et des colonnes.\n", "\n", "Ils ont priviligié le repérage des éléments d'une dataframe **par des index** (les **noms** de colonnes et les **labels** de lignes), et **pas** les **indices** comme en Python et en `numpy`\n", "\n", "Pourquoi ? parce que si vous utilisez `pandas` c'est que vous avez besoin de voir vos données sous la forme d'une table avec des labels pour indexer les lignes et les colonnes. Si vous n'avez pas besoin d'index particuliers, ça veut dire que vous êtes à l'aise pour manipuler vos données uniquement à base d'indices - des entiers - et dans ce cas-là autant utiliser un simple tableau `numpy` : vous n'allez pas stocker une matrice dans une dataframe ! `numpy` et ses indices ligne, colonne vous suffisent !" ] }, { "cell_type": "markdown", "id": "d37fcfb9", "metadata": {}, "source": [ "Néanmoins, `pandas` offre des techniques assez similaires, et assez puissantes aussi, que nous allons étudier dans ce notebook." ] }, { "cell_type": "markdown", "id": "8fb3f6ec", "metadata": {}, "source": [ "## rappels : `loc` pour les accès atomiques" ] }, { "cell_type": "markdown", "id": "91f33f01", "metadata": {}, "source": [ "on l'a vu dans le notebook précédent, les accès à un dataframe pandas se font \n", "\n", "* le plus souvent à base d'index et non pas d'indices\n", "* et dans ce cas on utilise `df.loc` pour accéder aux lignes et cellules" ] }, { "cell_type": "code", "execution_count": 10, "id": "65a3ac06", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedDeceased
PassengerId
80413Thomas, Master. Assad Alexandermale0.420126258.5167NaNC0
75612Hamalainen, Master. Viljomale0.671125064914.5000NaNS0
64513Baclini, Miss. Eugeniefemale0.7521266619.2583NaNC0
\n", "
" ], "text/plain": [ " Survived Pclass Name Sex Age \\\n", "PassengerId \n", "804 1 3 Thomas, Master. Assad Alexander male 0.42 \n", "756 1 2 Hamalainen, Master. Viljo male 0.67 \n", "645 1 3 Baclini, Miss. Eugenie female 0.75 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked Deceased \n", "PassengerId \n", "804 0 1 2625 8.5167 NaN C 0 \n", "756 1 1 250649 14.5000 NaN S 0 \n", "645 2 1 2666 19.2583 NaN C 0 " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(3)" ] }, { "cell_type": "code", "execution_count": 11, "id": "3bb2c191", "metadata": { "cell_style": "split" }, "outputs": [ { "data": { "text/plain": [ "'Hamalainen, Master. Viljo'" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# avec loc, c'est ligne, colonne\n", "# et avec des index (pas des indices)\n", "df.loc[756, 'Name']" ] }, { "cell_type": "code", "execution_count": 12, "id": "5c199168", "metadata": { "cell_style": "split" }, "outputs": [], "source": [ "# pour upgrader un passager\n", "df.loc[645, 'Pclass'] -= 1" ] }, { "cell_type": "code", "execution_count": 13, "id": "4d252626", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedDeceased
PassengerId
80413Thomas, Master. Assad Alexandermale0.420126258.5167NaNC0
75612Hamalainen, Master. Viljomale0.671125064914.5000NaNS0
64512Baclini, Miss. Eugeniefemale0.7521266619.2583NaNC0
\n", "
" ], "text/plain": [ " Survived Pclass Name Sex Age \\\n", "PassengerId \n", "804 1 3 Thomas, Master. Assad Alexander male 0.42 \n", "756 1 2 Hamalainen, Master. Viljo male 0.67 \n", "645 1 2 Baclini, Miss. Eugenie female 0.75 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked Deceased \n", "PassengerId \n", "804 0 1 2625 8.5167 NaN C 0 \n", "756 1 1 250649 14.5000 NaN S 0 \n", "645 2 1 2666 19.2583 NaN C 0 " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(3)" ] }, { "cell_type": "markdown", "id": "ece36893", "metadata": {}, "source": [ "## slicing" ] }, { "cell_type": "markdown", "id": "a6cf0f41", "metadata": {}, "source": [ "### `df.loc` et **bornes inclusives**" ] }, { "cell_type": "markdown", "id": "892fd162", "metadata": {}, "source": [ "Du coup, la première chose qu'on peut avoir envie de faire, c'est d'accéder à la dataframe par des *slices*; ça doit commencer à être banal maintenant, puisqu'à chaque fois qu'on voit une structure de données qui s'utilise avec `[]` on finit par étendre le sens de l'opération pour des slices.\n", "\n", "Rappelez-vous qu'en Python une slice c'est de la forme `start:stop:step`, et qu'on peut éluder les morceaux qu'on veut, c'est-à-dire que par exemple `:` désigne une slice qui couvre tout l'espace, `::-1` permet de renverser l'ordre, je vous renvoie aux chapitres idoines si ce n'est plus clair pour vous.\n", "\n", "**Par contre**, il faut tout de suite souligner une **différence**, qui est que **dans le cas des index** les slices de dataframes **contiennent les bornes**, ce qui, vous vous souvenez, n'a jamais été le cas jusqu'ici avec les slices en Python ou numpy, où la borne supérieure est toujours exclue; voyons cela" ] }, { "cell_type": "code", "execution_count": 14, "id": "669bf021", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedDeceased
PassengerId
80413Thomas, Master. Assad Alexandermale0.420126258.5167NaNC0
75612Hamalainen, Master. Viljomale0.671125064914.5000NaNS0
64512Baclini, Miss. Eugeniefemale0.7521266619.2583NaNC0
47013Baclini, Miss. Helene Barbarafemale0.7521266619.2583NaNC0
7912Caldwell, Master. Alden Gatesmale0.830224873829.0000NaNS0
\n", "
" ], "text/plain": [ " Survived Pclass Name Sex Age \\\n", "PassengerId \n", "804 1 3 Thomas, Master. Assad Alexander male 0.42 \n", "756 1 2 Hamalainen, Master. Viljo male 0.67 \n", "645 1 2 Baclini, Miss. Eugenie female 0.75 \n", "470 1 3 Baclini, Miss. Helene Barbara female 0.75 \n", "79 1 2 Caldwell, Master. Alden Gates male 0.83 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked Deceased \n", "PassengerId \n", "804 0 1 2625 8.5167 NaN C 0 \n", "756 1 1 250649 14.5000 NaN S 0 \n", "645 2 1 2666 19.2583 NaN C 0 \n", "470 2 1 2666 19.2583 NaN C 0 \n", "79 0 2 248738 29.0000 NaN S 0 " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(5)" ] }, { "cell_type": "code", "execution_count": 15, "id": "997a134c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedDeceased
PassengerId
75612Hamalainen, Master. Viljomale0.671125064914.5000NaNS0
64512Baclini, Miss. Eugeniefemale0.7521266619.2583NaNC0
47013Baclini, Miss. Helene Barbarafemale0.7521266619.2583NaNC0
\n", "
" ], "text/plain": [ " Survived Pclass Name Sex Age \\\n", "PassengerId \n", "756 1 2 Hamalainen, Master. Viljo male 0.67 \n", "645 1 2 Baclini, Miss. Eugenie female 0.75 \n", "470 1 3 Baclini, Miss. Helene Barbara female 0.75 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked Deceased \n", "PassengerId \n", "756 1 1 250649 14.5000 NaN S 0 \n", "645 2 1 2666 19.2583 NaN C 0 \n", "470 2 1 2666 19.2583 NaN C 0 " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# je sélectionne les lignes entre \n", "# l'index 756 et l'index 470 INCLUSIVEMENT\n", "\n", "df.loc[756:470]" ] }, { "cell_type": "markdown", "id": "161f9f7b", "metadata": {}, "source": [ "Il y a tout de même une certaine logique, c'est que les index sont a priori mélangés (et peuvent être des noms et pas des entiers), mais bon ca reste troublant au début. Et ce ne sera pas le cas pour `iloc` qui travaille sur les indices." ] }, { "cell_type": "markdown", "id": "8cbadf0a", "metadata": {}, "source": [ "### `df.loc` avec slicing sur les colonnes" ] }, { "cell_type": "markdown", "id": "4c12cbfe", "metadata": {}, "source": [ "Voyons comment faire du slicing dans l'autre direction" ] }, { "cell_type": "code", "execution_count": 16, "id": "996ae998", "metadata": { "cell_style": "split" }, "outputs": [ { "data": { "text/plain": [ "PassengerId\n", "804 3\n", "756 2\n", "645 2\n", "470 3\n", "79 2\n", " ..\n", "860 3\n", "864 3\n", "869 3\n", "879 3\n", "889 3\n", "Name: Pclass, Length: 891, dtype: int64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# si j'écris ceci, je désigne \n", "# toutes les lignes de la colonne \n", "# donc toute la colonne Pclass\n", "\n", "df.loc[:, 'Pclass']" ] }, { "cell_type": "code", "execution_count": 17, "id": "8d1737ff", "metadata": { "cell_style": "split", "tags": [ "level_advanced" ] }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# d'ailleurs effectivement, c'est optimisé\n", "# au point que c'est le même objet en mémoire !\n", "\n", "df.loc[:, 'Pclass'] is df['Pclass']" ] }, { "cell_type": "markdown", "id": "4cad311a", "metadata": {}, "source": [ "Et donc logiquement ici, si je veux sélectionner une plage de colonnes, je vais utiliser deux slices:\n", "\n", "* dans la direction des lignes, on prend tout avec une simple slice `:`\n", "* dans la direction des colonnes, le slicing marche aussi **en mode inclusif**" ] }, { "cell_type": "code", "execution_count": 18, "id": "fafac5a3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SexAgeSibSpParch
PassengerId
804male0.4201
756male0.6711
645female0.7521
\n", "
" ], "text/plain": [ " Sex Age SibSp Parch\n", "PassengerId \n", "804 male 0.42 0 1\n", "756 male 0.67 1 1\n", "645 female 0.75 2 1" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# ici comme pour les lignes, comme on est dans l'espace des index\n", "# et pas celui des indices, les bornes de la slice sont INCLUSIVES\n", "\n", "df.loc[:, 'Sex':'Parch'].head(3)" ] }, { "cell_type": "markdown", "id": "295952e1", "metadata": {}, "source": [ "### `df.loc` pour écrire : **bornes inclusives**" ] }, { "cell_type": "markdown", "id": "b2202658", "metadata": {}, "source": [ "On peut parfaitement modifier une dataframe au travers de slices, toujours en utilisant `df.loc`, et toujours avec bornes inclusives bien entendu :" ] }, { "cell_type": "code", "execution_count": 19, "id": "6f2c92b9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedDeceased
PassengerId
80413Thomas, Master. Assad Alexandermale0.420126258.5167NaNC0
75612Hamalainen, Master. Viljomale0.671125064914.5000NaNS0
64512Baclini, Miss. Eugeniefemale0.7521266619.2583NaNC0
47013Baclini, Miss. Helene Barbarafemale0.7521266619.2583NaNC0
7912Caldwell, Master. Alden Gatesmale0.830224873829.0000NaNS0
\n", "
" ], "text/plain": [ " Survived Pclass Name Sex Age \\\n", "PassengerId \n", "804 1 3 Thomas, Master. Assad Alexander male 0.42 \n", "756 1 2 Hamalainen, Master. Viljo male 0.67 \n", "645 1 2 Baclini, Miss. Eugenie female 0.75 \n", "470 1 3 Baclini, Miss. Helene Barbara female 0.75 \n", "79 1 2 Caldwell, Master. Alden Gates male 0.83 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked Deceased \n", "PassengerId \n", "804 0 1 2625 8.5167 NaN C 0 \n", "756 1 1 250649 14.5000 NaN S 0 \n", "645 2 1 2666 19.2583 NaN C 0 \n", "470 2 1 2666 19.2583 NaN C 0 \n", "79 0 2 248738 29.0000 NaN S 0 " ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(5)" ] }, { "cell_type": "code", "execution_count": 20, "id": "e063d49e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedDeceased
PassengerId
80413Thomas, Master. Assad Alexandermale0.420126258.5167NaNC0
75612Hamalainen, Master. Viljomale0.671000100025064914.5000NaNS0
64512Baclini, Miss. Eugeniefemale0.7520001000266619.2583NaNC0
47013Baclini, Miss. Helene Barbarafemale0.7520001000266619.2583NaNC0
7912Caldwell, Master. Alden Gatesmale0.830224873829.0000NaNS0
\n", "
" ], "text/plain": [ " Survived Pclass Name Sex Age \\\n", "PassengerId \n", "804 1 3 Thomas, Master. Assad Alexander male 0.42 \n", "756 1 2 Hamalainen, Master. Viljo male 0.67 \n", "645 1 2 Baclini, Miss. Eugenie female 0.75 \n", "470 1 3 Baclini, Miss. Helene Barbara female 0.75 \n", "79 1 2 Caldwell, Master. Alden Gates male 0.83 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked Deceased \n", "PassengerId \n", "804 0 1 2625 8.5167 NaN C 0 \n", "756 1000 1000 250649 14.5000 NaN S 0 \n", "645 2000 1000 2666 19.2583 NaN C 0 \n", "470 2000 1000 2666 19.2583 NaN C 0 \n", "79 0 2 248738 29.0000 NaN S 0 " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# sans vouloir chercher un \"use case\" très utile\n", "# multiplions par 1000 une portion de la dataframe\n", "\n", "# les lignes entre 756 et 470 inclusivement\n", "# les colonnes entre SibSp et Parch inclusivement\n", "\n", "# quand on écrit x *= 1000,\n", "# cela signifie x = x * 1000\n", "\n", "df.loc[756:470, 'SibSp':'Parch'] *= 1000\n", "\n", "# vérifions\n", "df.head(5)" ] }, { "cell_type": "markdown", "id": "260988b2", "metadata": {}, "source": [ "### slicing généralisé" ] }, { "cell_type": "markdown", "id": "9a4ac126", "metadata": {}, "source": [ "Bon bien sûr on peut mélanger toutes les features que nous connaissons déjà, et écrire des sélections arbitrairement compliquées - pas souvent utiles, mais simplement pour montrer que toute la logique est préservée" ] }, { "cell_type": "code", "execution_count": 21, "id": "3bfa130b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedDeceased
PassengerId
80413Thomas, Master. Assad Alexandermale0.420126258.5167NaNC0
75612Hamalainen, Master. Viljomale0.671000100025064914.5000NaNS0
64512Baclini, Miss. Eugeniefemale0.7520001000266619.2583NaNC0
47013Baclini, Miss. Helene Barbarafemale0.7520001000266619.2583NaNC0
7912Caldwell, Master. Alden Gatesmale0.830224873829.0000NaNS0
83212Richards, Master. George Sibleymale0.83112910618.7500NaNS0
30611Allison, Master. Hudson Trevormale0.9212113781151.5500C22 C26S0
82812Mallet, Master. Andremale1.0002S.C./PARIS 207937.0042NaNC0
\n", "
" ], "text/plain": [ " Survived Pclass Name Sex Age \\\n", "PassengerId \n", "804 1 3 Thomas, Master. Assad Alexander male 0.42 \n", "756 1 2 Hamalainen, Master. Viljo male 0.67 \n", "645 1 2 Baclini, Miss. Eugenie female 0.75 \n", "470 1 3 Baclini, Miss. Helene Barbara female 0.75 \n", "79 1 2 Caldwell, Master. Alden Gates male 0.83 \n", "832 1 2 Richards, Master. George Sibley male 0.83 \n", "306 1 1 Allison, Master. Hudson Trevor male 0.92 \n", "828 1 2 Mallet, Master. Andre male 1.00 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \\\n", "PassengerId \n", "804 0 1 2625 8.5167 NaN C \n", "756 1000 1000 250649 14.5000 NaN S \n", "645 2000 1000 2666 19.2583 NaN C \n", "470 2000 1000 2666 19.2583 NaN C \n", "79 0 2 248738 29.0000 NaN S \n", "832 1 1 29106 18.7500 NaN S \n", "306 1 2 113781 151.5500 C22 C26 S \n", "828 0 2 S.C./PARIS 2079 37.0042 NaN C \n", "\n", " Deceased \n", "PassengerId \n", "804 0 \n", "756 0 \n", "645 0 \n", "470 0 \n", "79 0 \n", "832 0 \n", "306 0 \n", "828 0 " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(8)" ] }, { "cell_type": "code", "execution_count": 22, "id": "1428fa5e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SexSibSpTicket
PassengerId
804male02625
645female20002666
79male0248738
306male1113781
\n", "
" ], "text/plain": [ " Sex SibSp Ticket\n", "PassengerId \n", "804 male 0 2625\n", "645 female 2000 2666\n", "79 male 0 248738\n", "306 male 1 113781" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# tous ce qu'on a appris jusqu'ici à propos des slices\n", "# fonctionne comme attendu, à part cette histoire de \n", "# borne supérieure qui est inclusive avec les index\n", "\n", "df.loc[804:828:2, 'Sex':'Ticket':2]" ] }, { "cell_type": "markdown", "id": "2c5914e7", "metadata": {}, "source": [ "### *copied or not copied, that is the question*" ] }, { "cell_type": "markdown", "id": "5757c29f", "metadata": {}, "source": [ "Pour terminer cette section, pour les curieux, il y a une question parfois épineuse qui se pose lorsqu'on fait des sélections de parties de dataframe.\n", "\n", "Quand une opération sur une dataframe `pandas` renvoie une sous-partie de la dataframe, savoir si cette sélection est en fait **une référence partagée** vers, ou si **c'est une copie** de la dataframe d'origine, ... dépend du contexte !!\n", "\n", "Bon très bien, vous dites-vous mais en quoi cela me concerne-t-il ! il gère bien comme il veut ses sous-tableaux, je ne vais pas m'en soucier ...\n", "\n", "alors oui cela est vrai ... jusqu'à ce que vous vous mettiez à modifier des sous-parties de dataframe ...\n", "\n", " - si la sous-partie est une **copie** de la sous-partie de dataframe, votre modification ne sera **pas prise en compte** sur la dataframe d'origine ! évidemment…\n", " \n", " - et si c'est une référence partagée vers une partie de la dataframe d'origine, alors vos modifications dans la sélection vont bien se répercuter dans les données d'origine.\n", " \n", "ahhh ... vous commencez à comprendre: savoir si une opération retourne une copie ou une référence devient important mais dépend du contexte.\n", "\n", "Ce qu'il faut retenir c'est que\n", "\n", "* en utilisant la forme `df.loc[line, column]` on ne crée pas de copie, c'est la bonne façon d'utiliser `loc`\n", "* par contre les formes qui utilisent un *chained indexing* - que ce soit `df[l][c]` ou `df.loc[l][c]`, on n'est plus du tout sûr du résultat : il ne faut pas les utiliser pour modifier quoi que ce soit !!" ] }, { "cell_type": "markdown", "id": "dc7e3fdd", "metadata": {}, "source": [ "## autres mécanismes d'indexation" ] }, { "cell_type": "markdown", "id": "4e14ad85", "metadata": {}, "source": [ "### accès à une liste explicite de lignes ou colonnes" ] }, { "cell_type": "markdown", "id": "011ee151", "metadata": {}, "source": [ "Nous voulons maintenant prendre une référence sur une sous-partie d'une dataframe qui **ne s'exprime pas sous la forme d'une slice (tranche)**, mais par contre nous possédons la liste des (index des) lignes et des colonnes que nous souhaitons conserver dans ma sous-partie de dataframe.\n", "\n", "`pandas` sait parfaitement le faire :\n", "\n", "* on utilise `df.loc[]` puisqu'on va désigner des index,\n", "* et on va passer dans les `[]`, non plus des slices, mais tout simplement des listes (et de plus, vous donnez les index dans l'ordre qui vous intéresse) :" ] }, { "cell_type": "markdown", "id": "c388926f", "metadata": {}, "source": [ "Prenons ainsi par exemple \n", "\n", "* les lignes d'index 450, 3, 67, 800 et 678\n", "* et les colonnes `Age`, `Pclass` et `Survived`\n", "\n", "Et comme ce sont des index, nous utilisons `loc`." ] }, { "cell_type": "code", "execution_count": 23, "id": "1fa46021", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AgePclassSurvived
PassengerId
45052.011
326.031
6729.021
80030.030
67818.031
\n", "
" ], "text/plain": [ " Age Pclass Survived\n", "PassengerId \n", "450 52.0 1 1\n", "3 26.0 3 1\n", "67 29.0 2 1\n", "800 30.0 3 0\n", "678 18.0 3 1" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# c'est facile de créer une sélection de lignes et de colonnes \n", "df.loc[[450, 3, 67, 800, 678], ['Age', 'Pclass', 'Survived']]" ] }, { "cell_type": "markdown", "id": "4880bc46", "metadata": {}, "source": [ "### recherche selon une formule booléenne" ] }, { "cell_type": "markdown", "id": "d52e7d9d", "metadata": {}, "source": [ "Nous avons vu dans le notebook précédent que nous pouvions faire des tests sur toutes les valeurs d'une colonne et que cela nous rendait un tableau de booléens." ] }, { "cell_type": "code", "execution_count": 24, "id": "944cbe7f", "metadata": { "cell_style": "center" }, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# cette expression retourne une Series\n", "mask = df['Pclass'] >= 3\n", "type(mask)" ] }, { "cell_type": "code", "execution_count": 25, "id": "cab3d2aa", "metadata": { "cell_style": "center" }, "outputs": [ { "data": { "text/plain": [ "PassengerId\n", "804 True\n", "756 False\n", "645 False\n", "470 True\n", "79 False\n", "Name: Pclass, dtype: bool" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# voyons ce qu'elle contient\n", "mask.head() # un masque de booléens sur la colonne des index donc la colonne PassengerId !" ] }, { "cell_type": "markdown", "id": "46d22ac4", "metadata": {}, "source": [ "La dernière manière d'accéder à des sous-parties de dataframe, va être d'**indexer** une dataframe par un **masque de booléens** sur la colonne des `index` i.e. on va isoler de la dataframe les lignes où la valeur du booléen est vraie.\n", "\n", "Par exemple, pour extraire de la dataframe les lignes correspondant aux voyageurs en 3-ième classe, on va utiliser `mask` - un objet de type `Series` donc, qui contient des booléens - comme moyen pour indexer la dataframe." ] }, { "cell_type": "code", "execution_count": 26, "id": "ca3df2e1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedDeceased
PassengerId
80413Thomas, Master. Assad Alexandermale0.420126258.5167NaNC0
47013Baclini, Miss. Helene Barbarafemale0.7520001000266619.2583NaNC0
38213Nakid, Miss. Maria (\"Mary\")female1.0002265315.7417NaNC0
16503Panula, Master. Eino Viljamimale1.0041310129539.6875NaNS1
38703Goodwin, Master. Sidney Leonardmale1.0052CA 214446.9000NaNS1
.......................................
86003Razi, Mr. RaihedmaleNaN0026297.2292NaNC1
86403Sage, Miss. Dorothy Edith \"Dolly\"femaleNaN82CA. 234369.5500NaNS1
86903van Melkebeke, Mr. PhilemonmaleNaN003457779.5000NaNS1
87903Laleff, Mr. KristomaleNaN003492177.8958NaNS1
88903Johnston, Miss. Catherine Helen \"Carrie\"femaleNaN12W./C. 660723.4500NaNS1
\n", "

490 rows × 12 columns

\n", "
" ], "text/plain": [ " Survived Pclass Name \\\n", "PassengerId \n", "804 1 3 Thomas, Master. Assad Alexander \n", "470 1 3 Baclini, Miss. Helene Barbara \n", "382 1 3 Nakid, Miss. Maria (\"Mary\") \n", "165 0 3 Panula, Master. Eino Viljami \n", "387 0 3 Goodwin, Master. Sidney Leonard \n", "... ... ... ... \n", "860 0 3 Razi, Mr. Raihed \n", "864 0 3 Sage, Miss. Dorothy Edith \"Dolly\" \n", "869 0 3 van Melkebeke, Mr. Philemon \n", "879 0 3 Laleff, Mr. Kristo \n", "889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n", "\n", " Sex Age SibSp Parch Ticket Fare Cabin Embarked \\\n", "PassengerId \n", "804 male 0.42 0 1 2625 8.5167 NaN C \n", "470 female 0.75 2000 1000 2666 19.2583 NaN C \n", "382 female 1.00 0 2 2653 15.7417 NaN C \n", "165 male 1.00 4 1 3101295 39.6875 NaN S \n", "387 male 1.00 5 2 CA 2144 46.9000 NaN S \n", "... ... ... ... ... ... ... ... ... \n", "860 male NaN 0 0 2629 7.2292 NaN C \n", "864 female NaN 8 2 CA. 2343 69.5500 NaN S \n", "869 male NaN 0 0 345777 9.5000 NaN S \n", "879 male NaN 0 0 349217 7.8958 NaN S \n", "889 female NaN 1 2 W./C. 6607 23.4500 NaN S \n", "\n", " Deceased \n", "PassengerId \n", "804 0 \n", "470 0 \n", "382 0 \n", "165 1 \n", "387 1 \n", "... ... \n", "860 1 \n", "864 1 \n", "869 1 \n", "879 1 \n", "889 1 \n", "\n", "[490 rows x 12 columns]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# voyez qu'ici dans les crochets on n'a plus \n", "# une slice, ni une liste, \n", "# mais une colonne (une Series) de booléens\n", "# qu'on appelle un masque\n", "\n", "df.loc[ mask ]" ] }, { "cell_type": "markdown", "id": "0cb6b247", "metadata": {}, "source": [ "Notez que bien souvent on ne prendra pas la peine de décortiquer comme ça, et on écrira directement" ] }, { "cell_type": "code", "execution_count": 27, "id": "6d2c9a3b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedDeceased
PassengerId
80413Thomas, Master. Assad Alexandermale0.420126258.5167NaNC0
47013Baclini, Miss. Helene Barbarafemale0.7520001000266619.2583NaNC0
38213Nakid, Miss. Maria (\"Mary\")female1.0002265315.7417NaNC0
\n", "
" ], "text/plain": [ " Survived Pclass Name Sex Age \\\n", "PassengerId \n", "804 1 3 Thomas, Master. Assad Alexander male 0.42 \n", "470 1 3 Baclini, Miss. Helene Barbara female 0.75 \n", "382 1 3 Nakid, Miss. Maria (\"Mary\") female 1.00 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked Deceased \n", "PassengerId \n", "804 0 1 2625 8.5167 NaN C 0 \n", "470 2000 1000 2666 19.2583 NaN C 0 \n", "382 0 2 2653 15.7417 NaN C 0 " ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# en une seule ligne, c'est un peu moins lisible \n", "# mais c'est un idiome fréquent\n", "\n", "# je rajoute .head(3) pour abrèger un peu\n", "\n", "df[df['Pclass'] >= 3].head(3)" ] }, { "cell_type": "markdown", "id": "481b599c", "metadata": {}, "source": [ "### combinaison d'expressions booléennes" ] }, { "cell_type": "markdown", "id": "f728dde3", "metadata": {}, "source": [ "Un peu plus sophistiqué, nous pouvons mettre **plusieurs conditions**, par exemple des passagers qui ne sont pas en première classe et dont l'age est supérieur à 70 ans.\n", "\n", "Mais comment écrire ces conditions ..." ] }, { "cell_type": "code", "execution_count": 28, "id": "bd0eec4b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Ce n'est pas bon, il me dit 'The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().'\n" ] } ], "source": [ "# on pourrait être tenté d'écrire quelque chose comme ceci\n", "\n", "try:\n", " df['Age'] >= 70 and not(df['Pclass'] == 1)\n", "except ValueError as e:\n", " print(f\"Ce n'est pas bon, il me dit '{e}'\")" ] }, { "cell_type": "markdown", "id": "4fbc7dda", "metadata": {}, "source": [ "Est-ce que cela ne vous rappelle pas quelque chose ? \n", "Nous avons déjà vu le même comportement lorsqu'il s'était agi d'écrire des conditions sur les tableaux `numpy`; \n", "alors oui parmi les petites choses que l'on peut trouver parfois contre-intuitives avec `numpy` et `pandas`, il y a les expressions logiques sur les tableaux de booléens.\n", "\n", "Vous ne pouvez **pas** utiliser `and`, `or` et `not` ! \n", "\n", " - soit vous utilisez les `np.logical_and`, `np.logical_or` et `np.logical_not` mais ce n'est pas super lisible ... \n", " \n", " - soit vous utilisez les `&`, `|` et `~` (les opérateurs logiques qu'on appelle *bitwise* i.e. qui travaillent bit à bit) et vous parenthésez bien !" ] }, { "cell_type": "code", "execution_count": 29, "id": "49cf56a3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PassengerId\n", "804 False\n", "756 False\n", "645 False\n", "dtype: bool" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mask_age = (df['Age'] >= 70) & (~ (df['Pclass'] == 1)) # une pandas.Series sur les index\n", "mask_age.head(3)" ] }, { "cell_type": "code", "execution_count": 30, "id": "e7f86089", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedDeceased
PassengerId
67302Mitchell, Mr. Henry Michaelmale70.000C.A. 2458010.500NaNS1
11703Connors, Mr. Patrickmale70.5003703697.750NaNQ1
85203Svensson, Mr. Johanmale74.0003470607.775NaNS1
\n", "
" ], "text/plain": [ " Survived Pclass Name Sex Age SibSp \\\n", "PassengerId \n", "673 0 2 Mitchell, Mr. Henry Michael male 70.0 0 \n", "117 0 3 Connors, Mr. Patrick male 70.5 0 \n", "852 0 3 Svensson, Mr. Johan male 74.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked Deceased \n", "PassengerId \n", "673 0 C.A. 24580 10.500 NaN S 1 \n", "117 0 370369 7.750 NaN Q 1 \n", "852 0 347060 7.775 NaN S 1 " ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[mask_age]" ] }, { "cell_type": "markdown", "id": "04e02720", "metadata": {}, "source": [ "Ou de la manière concise habituellement utilisée:" ] }, { "cell_type": "code", "execution_count": 31, "id": "78476331", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedDeceased
PassengerId
67302Mitchell, Mr. Henry Michaelmale70.000C.A. 2458010.500NaNS1
11703Connors, Mr. Patrickmale70.5003703697.750NaNQ1
85203Svensson, Mr. Johanmale74.0003470607.775NaNS1
\n", "
" ], "text/plain": [ " Survived Pclass Name Sex Age SibSp \\\n", "PassengerId \n", "673 0 2 Mitchell, Mr. Henry Michael male 70.0 0 \n", "117 0 3 Connors, Mr. Patrick male 70.5 0 \n", "852 0 3 Svensson, Mr. Johan male 74.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked Deceased \n", "PassengerId \n", "673 0 C.A. 24580 10.500 NaN S 1 \n", "117 0 370369 7.750 NaN Q 1 \n", "852 0 347060 7.775 NaN S 1 " ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# plus de 70 ans, et pas en première classe\n", "# remarquez que ça se bouscule pas dans cette catégorie...\n", "\n", "df.loc [ (df['Age'] >= 70) & (~ (df['Pclass'] == 1)) ] " ] }, { "cell_type": "code", "execution_count": 32, "id": "dc1bfc57", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedDeceased
PassengerId
67302Mitchell, Mr. Henry Michaelmale70.000C.A. 2458010.500NaNS1
11703Connors, Mr. Patrickmale70.5003703697.750NaNQ1
85203Svensson, Mr. Johanmale74.0003470607.775NaNS1
\n", "
" ], "text/plain": [ " Survived Pclass Name Sex Age SibSp \\\n", "PassengerId \n", "673 0 2 Mitchell, Mr. Henry Michael male 70.0 0 \n", "117 0 3 Connors, Mr. Patrick male 70.5 0 \n", "852 0 3 Svensson, Mr. Johan male 74.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked Deceased \n", "PassengerId \n", "673 0 C.A. 24580 10.500 NaN S 1 \n", "117 0 370369 7.750 NaN Q 1 \n", "852 0 347060 7.775 NaN S 1 " ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# pareil avec les opérateurs numpy\n", "# personnellement je préfère la version précédente mais bon\n", "\n", "df.loc [ np.logical_and(df['Age'] >= 70, np.logical_not(df['Pclass'] == 1)) ] # bof ..." ] }, { "cell_type": "markdown", "id": "a9c39986", "metadata": { "tags": [] }, "source": [ "### résumé des méthodes d'indexation" ] }, { "cell_type": "markdown", "id": "149dba25", "metadata": {}, "source": [ "Pour résumer cette partie, nous avons vu trois méthodes d'indexation utilisables avec `loc` :\n", "\n", "* on peut utiliser une slice, et parce qu'on manipule des index et pas des indices dans ce cas **les bornes sont inclusives** (on va voir tout de suite qu'avec les indices par contre les bornes sont les bornes habituelles, avec la fin exclue)\n", "* on peut utiliser une liste explicite, pour choisir exactement et dans le bon ordre les index qui nous intéressent\n", "* on peut utiliser un masque, c'est-à-dire une colonne obtenue en appliquant une expression booléenne à la dataframe de départ - cette méthode s'applique sans doute plus volontiers à la sélection de lignes" ] }, { "cell_type": "markdown", "id": "0171ff13", "metadata": { "tags": [ "level_advanced" ] }, "source": [ "Remarquez d'ailleurs, pour les geeks, que si on veut on peut même mélanger ces trois méthodes d'indexation; c'est-à-dire par exemple utiliser une liste pour les lignes et une slice pour les colonnes :" ] }, { "cell_type": "code", "execution_count": 33, "id": "307c3ecb", "metadata": { "tags": [ "level_advanced" ] }, "outputs": [], "source": [ "# on peut indexer par exemple\n", "# les lignes avec une liste\n", "# les colonnes avec une slice" ] }, { "cell_type": "code", "execution_count": 34, "id": "3f32966c", "metadata": { "tags": [ "level_advanced" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SexSibSpTicketCabin
PassengerId
450male0113786C104
3female0STON/O2. 3101282NaN
67female0C.A. 29395F33
800female1345773NaN
678female04138NaN
\n", "
" ], "text/plain": [ " Sex SibSp Ticket Cabin\n", "PassengerId \n", "450 male 0 113786 C104\n", "3 female 0 STON/O2. 3101282 NaN\n", "67 female 0 C.A. 29395 F33\n", "800 female 1 345773 NaN\n", "678 female 0 4138 NaN" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[\n", " # dans la dimension des lignes: une liste\n", " [450, 3, 67, 800, 678], \n", " # dans la dimension des colonnes: une slice\n", " 'Sex':'Cabin':2]" ] }, { "cell_type": "markdown", "id": "4b966a1a", "metadata": {}, "source": [ "## travailler avec les indices : **bornes habituelles**" ] }, { "cell_type": "markdown", "id": "961c97db", "metadata": {}, "source": [ "Dans les - rares - cas où on veut travailler avec les indices plutôt qu'avec les index, tout fonctionne presque exactement pareil qu'avec les index, sauf que\n", "\n", "* on doit utiliser `iloc` au lieu de `loc`, bien entendu\n", "* qui supportent les mêmes mécanismes de *slicing* et d'indexation que l'on vient de voir,\n", "* et dans ce cas comme on est dans l'espace des indices, **les bornes des slices** se comportent comme les **bornes habituelles (début inclus, fin exclue)**\n", "\n", "Je vous invite à vérifier ce point par vous même, en remettant à leur valeur originelle la portion de la dataframe que l'on avait un peu arbitrairement multipliée par 1000 tout à l'heure, tout ça en utilisant `iloc`" ] }, { "cell_type": "code", "execution_count": 35, "id": "ac59cace", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedDeceased
PassengerId
80413Thomas, Master. Assad Alexandermale0.420126258.5167NaNC0
75612Hamalainen, Master. Viljomale0.671000100025064914.5000NaNS0
64512Baclini, Miss. Eugeniefemale0.7520001000266619.2583NaNC0
47013Baclini, Miss. Helene Barbarafemale0.7520001000266619.2583NaNC0
7912Caldwell, Master. Alden Gatesmale0.830224873829.0000NaNS0
\n", "
" ], "text/plain": [ " Survived Pclass Name Sex Age \\\n", "PassengerId \n", "804 1 3 Thomas, Master. Assad Alexander male 0.42 \n", "756 1 2 Hamalainen, Master. Viljo male 0.67 \n", "645 1 2 Baclini, Miss. Eugenie female 0.75 \n", "470 1 3 Baclini, Miss. Helene Barbara female 0.75 \n", "79 1 2 Caldwell, Master. Alden Gates male 0.83 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked Deceased \n", "PassengerId \n", "804 0 1 2625 8.5167 NaN C 0 \n", "756 1000 1000 250649 14.5000 NaN S 0 \n", "645 2000 1000 2666 19.2583 NaN C 0 \n", "470 2000 1000 2666 19.2583 NaN C 0 \n", "79 0 2 248738 29.0000 NaN S 0 " ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# je vous rappelle où on en est\n", "df.head(5)" ] }, { "cell_type": "code", "execution_count": 36, "id": "11904976", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ellipsis" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# votre mission si vous l'acceptez\n", "# rediviser par 1000 les 6 cases, mais à bases d'indices cette fois-ci\n", "# donc en utilisant iloc\n", "..." ] }, { "cell_type": "markdown", "id": "c9b63ab9", "metadata": { "tags": [ "level_intermediate" ] }, "source": [ "## problème de modification de copies (pour les avancés)" ] }, { "cell_type": "markdown", "id": "afb29416", "metadata": { "tags": [ "level_intermediate" ] }, "source": [ "En première lecture de ce notebook, cette section ne sera compréhensible que par des élèves avancés, les autres pourront y revenir plus tard." ] }, { "cell_type": "markdown", "id": "2180b772", "metadata": { "tags": [ "level_intermediate" ] }, "source": [ "On va voir rapidement le problème de *tentative* de modification d'une copie d'une dataframe." ] }, { "cell_type": "markdown", "id": "2d927a88", "metadata": { "tags": [ "level_intermediate" ] }, "source": [ "### modification par chaînage d'indexations" ] }, { "cell_type": "markdown", "id": "cd2c6163", "metadata": { "tags": [ "level_intermediate" ] }, "source": [ "Supposez qu'on accède à une colonne, par exemple celle de la survie qui s'appelle `Survived`, en utilisant la syntaxe classique d'accès à une clé d'un dictionnaire." ] }, { "cell_type": "code", "execution_count": 37, "id": "7c987ccb", "metadata": { "tags": [ "level_intermediate" ] }, "outputs": [ { "data": { "text/plain": [ "PassengerId\n", "804 1\n", "756 1\n", "645 1\n", "470 1\n", "79 1\n", " ..\n", "860 0\n", "864 0\n", "869 0\n", "879 0\n", "889 0\n", "Name: Survived, Length: 891, dtype: int64" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Survived']" ] }, { "cell_type": "markdown", "id": "c38f8cc0", "metadata": { "tags": [ "level_intermediate" ] }, "source": [ "On obtient une seule colonne, elle est de type `pandas.Series`, on le savait déjà." ] }, { "cell_type": "markdown", "id": "29c132fd", "metadata": { "tags": [ "level_intermediate" ] }, "source": [ "Maintenant que j'ai une colonne, rien ne m'empêche d'accéder à un élément de la colonne, avec la simple notation d'accès à un élément d'un tableau comme dans Python, prenons l'élément d'index 1." ] }, { "cell_type": "code", "execution_count": 38, "id": "72407296", "metadata": { "tags": [ "level_intermediate" ] }, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# so far, so good\n", "df['Survived'][1]" ] }, { "cell_type": "markdown", "id": "0cde150b", "metadata": { "tags": [ "level_intermediate" ] }, "source": [ "Maintenant LA question. Je viens d'accéder à un élément de la colonne `Survived`, puis-je utiliser cette manière d'accéder pour modifier l'élément ?\n", "\n", "Dit autrement, puis-je ressusciter le pauvre passager d'index 1 en faisant passer son état de survie à 1 par l'affectation `df['Survived'][1] = 1`\n", "\n", "La réponse est non ! Pourquoi ? parce que `df['Survived'][1]` est une copie ! pas une référence vers une partie de la dataframe `df` !\n", "\n", "On appelle cela une *indexation par chaînage* (on chaîne `['Survived']`et `[1]`) et bien: *toutes les indexations par chaînage sont des copies* et ne peuvent pas donner lieu à des modifications ...\n", "\n", "Vous avez l'obligation d'utiliser `loc` ou `iloc` !" ] }, { "cell_type": "markdown", "id": "7a848b8a", "metadata": { "tags": [ "level_intermediate" ] }, "source": [ "Pour les avancés ce *problème* s'appelle le *chained indexing* et pour plus d'explications regardez là https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy (quand vous en aurez le temps ...)" ] }, { "cell_type": "markdown", "id": "c188340d", "metadata": { "tags": [ "level_intermediate" ] }, "source": [ "### indexation par une liste et modification" ] }, { "cell_type": "markdown", "id": "757d6149", "metadata": { "tags": [ "level_intermediate" ] }, "source": [ "On va indexer une dataframe par une liste d'index de colonnes sans utiliser `loc` ni `iloc`. Dans cet exemple on isole les trois colonnes `Survived`, `Pclass` et `Sex`" ] }, { "cell_type": "code", "execution_count": 39, "id": "d4e4435c", "metadata": { "tags": [ "level_intermediate" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassSex
PassengerId
80413male
75612male
64512female
47013female
7912male
\n", "
" ], "text/plain": [ " Survived Pclass Sex\n", "PassengerId \n", "804 1 3 male\n", "756 1 2 male\n", "645 1 2 female\n", "470 1 3 female\n", "79 1 2 male" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = df[ ['Survived', 'Pclass', 'Sex'] ]\n", "df1.head()" ] }, { "cell_type": "markdown", "id": "122e6977", "metadata": { "tags": [ "level_intermediate" ] }, "source": [ "On obtient une dataframe que nous appelons `df1`. Donc vous vous rappelez que nous avons deux possibilité pour la sous-partie d'une dataframe, obtenue par découpage de la dataframe d'origine:\n", " - c'est une copie de la dataframe (vous ne devez pas la modifier)\n", " - c'est une référence sur la dataframe (vous pouvez la modifier)." ] }, { "cell_type": "markdown", "id": "ccec6d38", "metadata": { "tags": [ "level_intermediate" ] }, "source": [ "LA question est donc de savoir si `df1` est une copie ou une référence sur votre dataframe ?\n", "\n", "C'est une copie donc vous ne devez pas tenter de la modifier mais on va le faire.\n", "\n", "On tente de ressusciter notre pauvre passager d'index 1 en utilisant `loc` sur la sous-dataframe `df1` (on a oublié que `df1` était une copie)." ] }, { "cell_type": "markdown", "id": "5ab2723b", "metadata": { "tags": [ "level_intermediate" ] }, "source": [ "On regarde ce que vaut l'élément qu'on veut modifier:" ] }, { "cell_type": "code", "execution_count": 40, "id": "a9bb4555", "metadata": { "tags": [ "level_intermediate" ] }, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.loc[1, 'Survived']" ] }, { "cell_type": "markdown", "id": "ff18f460", "metadata": { "tags": [ "level_intermediate" ] }, "source": [ "ok 0. On tente de le modifier:" ] }, { "cell_type": "code", "execution_count": 41, "id": "84b58ef9", "metadata": { "tags": [ "level_intermediate" ] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/share/miniconda/envs/python-numérique/lib/python3.9/site-packages/pandas/core/indexing.py:1817: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " self._setitem_single_column(loc, value, pi)\n" ] } ], "source": [ "df1.loc[1, 'Survived'] = 1" ] }, { "cell_type": "markdown", "id": "ba669671", "metadata": { "tags": [ "level_intermediate" ] }, "source": [ "Je recois un warning de `pandas` me disant que j'ai potentiellement un problème. Comme il n'est pas sûr que pour moi ca en soit un, il me donne un simple avertissement et non une erreur.\n", "\n", "En fait, là il m'indique que: si je pensais modifier `df` en passant par `df1` alors je me trompe puisque `df1` est une copie de ma dataframe `df`, donc `df` ne sera pas modifié.\n", "\n", "Il se peut que ce soit ce que vous voulez (que `df1` soit une copie) ! mais alors pourquoi ne l'avez vous pas clairement indiqué en faisant une copie explicite !" ] }, { "cell_type": "markdown", "id": "500aff95", "metadata": { "tags": [ "level_intermediate" ] }, "source": [ "Si mon idée était bien de ne modifier que `df1` parce que je veux une copie de `df`: alors je le code **proprement**:" ] }, { "cell_type": "code", "execution_count": 42, "id": "dab6ea7c", "metadata": { "tags": [ "level_intermediate" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassSex
PassengerId
80413male
75612male
64512female
47013female
7912male
\n", "
" ], "text/plain": [ " Survived Pclass Sex\n", "PassengerId \n", "804 1 3 male\n", "756 1 2 male\n", "645 1 2 female\n", "470 1 3 female\n", "79 1 2 male" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 = df[ ['Survived', 'Pclass', 'Sex'] ].copy()\n", "df2.head()" ] }, { "cell_type": "code", "execution_count": 43, "id": "ca06e724", "metadata": { "tags": [ "level_intermediate" ] }, "outputs": [], "source": [ "df2.loc[1, 'Survived'] = 1" ] }, { "cell_type": "markdown", "id": "a385bf30", "metadata": { "tags": [ "level_intermediate" ] }, "source": [ "Ah voilà qui est mieux !" ] } ], "metadata": { "jupytext": { "cell_metadata_filter": "all,-hidden,-heading_collapsed", "notebook_metadata_filter": "all,-language_info,-toc,-jupytext.text_representation.jupytext_version,-jupytext.text_representation.format_version", "text_representation": { "extension": ".md", "format_name": "myst" } }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.5" }, "notebookname": "slicing et dataframe", "source_map": [ 16, 22, 26, 30, 35, 39, 42, 46, 49, 53, 57, 60, 64, 68, 73, 81, 83, 87, 93, 101, 106, 110, 116, 124, 128, 132, 139, 143, 151, 158, 160, 164, 168, 176, 180, 185, 189, 193, 197, 207, 215, 222, 227, 231, 235, 239, 253, 257, 261, 265, 271, 275, 296, 300, 304, 313, 322, 325, 329, 333, 341, 346, 352, 359, 363, 370, 374, 380, 387, 399, 404, 406, 410, 417, 424, 428, 436, 440, 448, 456, 460, 470, 475, 482, 486, 490, 494, 498, 502, 508, 512, 516, 523, 535, 539, 543, 547, 554, 560, 568, 572, 578, 582, 588, 596, 600, 607, 613 ], "version": "1.0" }, "nbformat": 4, "nbformat_minor": 5 }