Gérer les données de séries chronologiques avec pandas

Changer le format du datetime:

on veut travailler avec les dates dans la colonne date.utc en tant qu'objets datetime au lieu de texte brut.

import pandas as pd
air_quality  = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/air_quality_no2_long.csv')
print(air_quality["date.utc"].head())
air_quality["date.utc"] = pd.to_datetime(air_quality["date.utc"])
print(air_quality["date.utc"].head())

Output:

0    2019-06-21 00:00:00+00:00
1    2019-06-20 23:00:00+00:00
2    2019-06-20 22:00:00+00:00
3    2019-06-20 21:00:00+00:00
4    2019-06-20 20:00:00+00:00
Name: date.utc, dtype: object
0   2019-06-21 00:00:00+00:00
1   2019-06-20 23:00:00+00:00
2   2019-06-20 22:00:00+00:00
3   2019-06-20 21:00:00+00:00
4   2019-06-20 20:00:00+00:00
Name: date.utc, dtype: datetime64[ns, UTC]

ou bien on peut utiliser parse_dates:

import pandas as pd
air_quality  = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/air_quality_no2_long.csv', parse_dates=["date.utc"])
print(air_quality["date.utc"].head())

Output:

0   2019-06-21 00:00:00+00:00
1   2019-06-20 23:00:00+00:00
2   2019-06-20 22:00:00+00:00
3   2019-06-20 21:00:00+00:00
4   2019-06-20 20:00:00+00:00
Name: date.utc, dtype: datetime64[ns, UTC]

Regroupement max et min.

import pandas as pd
air_quality  = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/air_quality_no2_long.csv', parse_dates=["date.utc"])
air_quality["date.utc"].min(), air_quality["date.utc"].max()

 

Output:

(Timestamp('2019-05-07 01:00:00+0000', tz='UTC'),
 Timestamp('2019-06-21 00:00:00+0000', tz='UTC'))

Substraction:

import pandas as pd
air_quality  = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/air_quality_no2_long.csv', parse_dates=["date.utc"])
air_quality["date.utc"].max() - air_quality["date.utc"].min()

Output:

Timedelta('44 days 23:00:00')

Regroupement par période:

import pandas as pd
air_quality  = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/air_quality_no2_long.csv', parse_dates=["date.utc"])
air_quality["month"] = air_quality["date.utc"].dt.month
air_quality.head()

Output:

	city	country	date.utc	location	parameter	value	unit	month
0	Paris	FR	2019-06-21 00:00:00+00:00	FR04014	no2	20.0	µg/m³	6
1	Paris	FR	2019-06-20 23:00:00+00:00	FR04014	no2	21.8	µg/m³	6
2	Paris	FR	2019-06-20 22:00:00+00:00	FR04014	no2	26.5	µg/m³	6
3	Paris	FR	2019-06-20 21:00:00+00:00	FR04014	no2	24.9	µg/m³	6
4	Paris	FR	2019-06-20 20:00:00+00:00	FR04014	no2	21.4	µg/m³	6

On peut aussi faire des regroupement par semaine "weekday" par heur " hour" ou jours de la semaine "dayofweek".

import pandas as pd
air_quality  = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/air_quality_no2_long.csv', parse_dates=["date.utc"])
air_quality.groupby([air_quality["date.utc"].dt.weekday, "location"])["value"].mean()

Output:

date.utc  location          
0         BETR801               27.875000
          FR04014               24.856250
          London Westminster    23.969697
1         BETR801               22.214286
          FR04014               30.999359
          London Westminster    24.885714
2         BETR801               21.125000
          FR04014               29.165753
          London Westminster    23.460432
3         BETR801               27.500000
          FR04014               28.600690
          London Westminster    24.780142
4         BETR801               28.400000
          FR04014               31.617986
          London Westminster    26.446809
5         BETR801               33.500000
          FR04014               25.266154
          London Westminster    24.977612
6         BETR801               21.896552
          FR04014               23.274306
          London Westminster    24.859155
Name: value, dtype: float64

Regrouper avec Resample:

Regrouper en mois et max

import pandas as pd
air_quality  = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/air_quality_no2_long.csv', parse_dates=["date.utc"],index_col="date.utc")
monthly_max = air_quality.resample("M").max()
monthly_max

Output:

	city	country	location	parameter	value	unit
date.utc						
2019-05-31 00:00:00+00:00	Paris	GB	London Westminster	no2	97.0	µg/m³
2019-06-30 00:00:00+00:00	Paris	GB	London Westminster	no2	84.7	µg/m³

Regrouper en semaine et moyenne

import pandas as pd
air_quality  = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/air_quality_no2_long.csv', parse_dates=["date.utc"],index_col="date.utc")
monthly_max = air_quality.resample("W").mean()
monthly_max

Output:

import pandas as pd
air_quality  = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/air_quality_no2_long.csv', parse_dates=["date.utc"],index_col="date.utc")
monthly_max = air_quality.resample("W").mean()
monthly_max