Introducción a pandas

Contenido

Introducción a pandas#

Pandas es un paquete de Python que proporciona estructuras de datos similares a los dataframes de R. Pandas depende de Numpy, la librería que añade un potente tipo matricial a Python. Los principales tipos de datos que pueden representarse con pandas son:

Datos tabulares con columnas de tipo heterogéneo con etiquetas en columnas y filas.

Series temporales.

Pandas proporciona herramientas que permiten:

Leer y escribir datos en diferentes formatos: CSV, Microsoft Excel, bases SQL y formato HDF5

Seleccionar y filtrar de manera sencilla tablas de datos en función de posición, valor o etiquetas

Fusionar y unir datos

Transformar datos aplicando funciones tanto en global como por ventanas

Manipulación de series temporales

Hacer gráficas

En pandas existen tres tipos básicos de objetos todos ellos basados a su vez en Numpy:

Series (listas, 1D), DataFrame (tablas, 2D) y Panels (tablas 3D).

Revisar otros tutoriales:

https://pandas.pydata.org/pandas-docs/stable/tutorials.html

https://bioinf.comav.upv.es/courses/linux/python/pandas.html

Importar librerias#

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Lectura del dataset#

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

# Lectura de un dataset (CSV), y almancenamiento en la variable "d"
d = pd.read_csv("/content/drive/MyDrive/clase 6 partes/sampledata.csv")

Comandos y operaciones principales de Pandas#

# Visualización de los primeros y últimos 5 elementos
d

	id	first_name	last_name	email	country	ip_address	longitude	latitude	age	score
0	1	Sandra	Sims	ssims0@microsoft.com	Armenia	63.84.115.63	44.43054	40.14493	44	0.62
1	2	Anna	Bishop	abishop1@mtv.com	China	204.108.246.11	118.29169	29.67594	90	0.22
2	3	Virginia	Rodriguez	vrodriguez2@so-net.ne.jp	Portugal	88.65.157.43	-8.63330	41.40000	39	0.33
3	4	Julia	Stanley	jstanley3@abc.net.au	China	76.197.170.103	102.19379	38.50062	NaN	0.64
4	5	Jacqueline	Gutierrez	jgutierrez4@shinystat.com	Poland	159.13.71.38	18.54003	54.17062	14	0.50
...	...	...	...	...	...	...	...	...	...	...
995	996	Jean	Payne	jpaynern@bluehost.com	Mexico	191.211.230.212	-92.89380	16.21790	40	0.30
996	997	Michelle	Murphy	mmurphyro@senate.gov	Croatia	129.124.79.138	15.85000	45.78333	87	0.54
997	998	Joseph	Hill	jhillrp@princeton.edu	Japan	161.176.64.135	139.80000	36.41667	50	0.57
998	999	Roger	Harrison	rharrisonrq@gizmodo.com	Poland	7.194.73.231	20.37415	49.84485	18	0.37
999	1000	Kelly	Henry	khenryrr@squidoo.com	Indonesia	145.118.102.45	111.45570	-7.21950	83	0.39

1000 rows × 10 columns

# Ver el tipo de variable de los dataframes de Pandas
type(d)

pandas.core.frame.DataFrame
def __init__(data=None, index: Axes | None=None, columns: Axes | None=None, dtype: Dtype | None=None, copy: bool | None=None) -> None

/usr/local/lib/python3.10/dist-packages/pandas/core/frame.pyTwo-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns).
Arithmetic operations align on both row and column labels. Can be
thought of as a dict-like container for Series objects. The primary
pandas data structure.

Parameters
----------
data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame
    Dict can contain Series, arrays, constants, dataclass or list-like objects. If
    data is a dict, column order follows insertion-order. If a dict contains Series
    which have an index defined, it is aligned by its index.

    .. versionchanged:: 0.25.0
       If data is a list of dicts, column order follows insertion-order.

index : Index or array-like
    Index to use for resulting frame. Will default to RangeIndex if
    no indexing information part of input data and no index provided.
columns : Index or array-like
    Column labels to use for resulting frame when data does not have them,
    defaulting to RangeIndex(0, 1, 2, ..., n). If data contains column labels,
    will perform column selection instead.
dtype : dtype, default None
    Data type to force. Only a single dtype is allowed. If None, infer.
copy : bool or None, default None
    Copy data from inputs.
    For dict data, the default of None behaves like ``copy=True``.  For DataFrame
    or 2d ndarray input, the default of None behaves like ``copy=False``.
    If data is a dict containing one or more Series (possibly of different dtypes),
    ``copy=False`` will ensure that these inputs are not copied.

    .. versionchanged:: 1.3.0

See Also
--------
DataFrame.from_records : Constructor from tuples, also record arrays.
DataFrame.from_dict : From dicts of Series, arrays, or dicts.
read_csv : Read a comma-separated values (csv) file into DataFrame.
read_table : Read general delimited file into DataFrame.
read_clipboard : Read text from clipboard into DataFrame.

Notes
-----
Please reference the :ref:`User Guide <basics.dataframe>` for more information.

Examples
--------
Constructing DataFrame from a dictionary.

>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df
   col1  col2
0     1     3
1     2     4

Notice that the inferred dtype is int64.

>>> df.dtypes
col1    int64
col2    int64
dtype: object

To enforce a single dtype:

>>> df = pd.DataFrame(data=d, dtype=np.int8)
>>> df.dtypes
col1    int8
col2    int8
dtype: object

Constructing DataFrame from a dictionary including Series:

>>> d = {'col1': [0, 1, 2, 3], 'col2': pd.Series([2, 3], index=[2, 3])}
>>> pd.DataFrame(data=d, index=[0, 1, 2, 3])
   col1  col2
0     0   NaN
1     1   NaN
2     2   2.0
3     3   3.0

Constructing DataFrame from numpy ndarray:

>>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
...                    columns=['a', 'b', 'c'])
>>> df2
   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9

Constructing DataFrame from a numpy ndarray that has labeled columns:

>>> data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)],
...                 dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")])
>>> df3 = pd.DataFrame(data, columns=['c', 'a'])
...
>>> df3
   c  a
0  3  1
1  6  4
2  9  7

Constructing DataFrame from dataclass:

>>> from dataclasses import make_dataclass
>>> Point = make_dataclass("Point", [("x", int), ("y", int)])
>>> pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)])
   x  y
0  0  0
1  0  3
2  2  3

# Con iloc se puede extraer sólo partes particulares deseadas del dataframe
d.iloc[1:10,3:4]

	email
1	abishop1@mtv.com
2	vrodriguez2@so-net.ne.jp
3	jstanley3@abc.net.au
4	jgutierrez4@shinystat.com
5	nlittle5@latimes.com
6	tfox6@squarespace.com
7	jparker7@gnu.org
8	kwalker8@nba.com
9	wbrown9@boston.com

# Mostrar estadísticas básicas para las variables numéricas
d.describe()

	id	longitude	latitude	score
count	1000.000000	999.000000	999.000000	1000.000000
mean	500.500000	46.831119	24.273880	0.501290
std	288.819436	70.556083	24.272596	0.103403
min	1.000000	-149.333330	-53.787690	0.180000
25%	250.750000	7.046050	8.135695	0.430000
50%	500.500000	42.174440	29.819050	0.510000
75%	750.250000	112.349640	43.186925	0.570000
max	1000.000000	175.496340	72.786840	0.830000

# Acceder a un sólo elemento del dataframe
d.iloc[2,2]

'Rodriguez'

# Acceder a registros seleccionados con todas las columnas
d.iloc[[1,3,5],]

	id	first_name	last_name	email	country	ip_address	longitude	latitude	age	score
1	2	Anna	Bishop	abishop1@mtv.com	China	204.108.246.11	118.29169	29.67594	90	0.22
3	4	Julia	Stanley	jstanley3@abc.net.au	China	76.197.170.103	102.19379	38.50062	NaN	0.64
5	6	Nicole	Little	nlittle5@latimes.com	Panama	235.246.1.116	-78.13774	8.40693	10	0.30

# Seleccionar los primeros diez registros, luego seleccionar solamente 3 columnas
d.iloc[range(10),][["last_name", "country", "score"]]

	last_name	country	score
0	Sims	Armenia	0.62
1	Bishop	China	0.22
2	Rodriguez	Portugal	0.33
3	Stanley	China	0.64
4	Gutierrez	Poland	0.50
5	Little	Panama	0.30
6	Fox	Iran	0.33
7	Parker	China	0.55
8	Walker	Slovenia	0.50
9	Brown	Ethiopia	0.41

# Filtrar el dataset para sólo obtener edades iguales a 44 años
d[d.age=='44']

	id	first_name	last_name	email	country	ip_address	longitude	latitude	age	score
0	1	Sandra	Sims	ssims0@microsoft.com	Armenia	63.84.115.63	44.43054	40.14493	44	0.62
169	170	Jason	Stevens	jstevens4p@foxnews.com	Czech Republic	105.101.112.178	16.11480	50.26742	44	0.54
188	189	Ruby	Gonzales	rgonzales58@pbs.org	Portugal	122.247.105.164	-8.91670	38.70000	44	0.58
251	252	Deborah	Patterson	dpatterson6z@salon.com	Palau	177.16.160.109	134.71725	8.08228	44	0.61
337	338	Todd	Henderson	thenderson9d@latimes.com	Indonesia	4.45.250.25	113.48740	-8.28670	44	0.53
473	474	Nicholas	Scott	nscottd5@samsung.com	Russia	165.111.40.207	104.24940	56.70320	44	0.37
550	551	Kathy	Burke	kburkefa@pen.io	Indonesia	197.17.149.125	115.11730	-8.46090	44	0.73
706	707	Kathryn	Mitchell	kmitchelljm@boston.com	Azerbaijan	113.1.57.57	50.04002	40.56441	44	0.66
723	724	Aaron	Chapman	achapmank3@mayoclinic.com	Indonesia	59.111.140.158	106.02910	-6.74290	44	0.39
843	844	Diana	Bennett	dbennettnf@wikipedia.org	Cuba	147.157.224.46	-74.15181	20.24673	44	0.46

# Ver el tipo de datos almacenados en una columna: dtype('O') para Objetos o Strings
d['age'].dtype

dtype('O')

# Filtrar el dataset para obtener los registros cuando las edades son de 55 o 44 años
d[np.logical_or(d.age=="55", d.age=="44")]

	id	first_name	last_name	email	country	ip_address	longitude	latitude	age	score
0	1	Sandra	Sims	ssims0@microsoft.com	Armenia	63.84.115.63	44.43054	40.14493	44	0.62
18	19	Jimmy	Robertson	jrobertsoni@thetimes.co.uk	Tajikistan	184.90.118.60	68.44668	37.56702	55	0.60
72	73	Michael	Powell	mpowell20@youtube.com	China	232.233.196.224	120.21400	27.32734	55	0.59
169	170	Jason	Stevens	jstevens4p@foxnews.com	Czech Republic	105.101.112.178	16.11480	50.26742	44	0.54
188	189	Ruby	Gonzales	rgonzales58@pbs.org	Portugal	122.247.105.164	-8.91670	38.70000	44	0.58
251	252	Deborah	Patterson	dpatterson6z@salon.com	Palau	177.16.160.109	134.71725	8.08228	44	0.61
266	267	Gerald	Butler	gbutler7e@google.fr	Philippines	149.199.214.71	121.41470	14.12840	55	0.55
337	338	Todd	Henderson	thenderson9d@latimes.com	Indonesia	4.45.250.25	113.48740	-8.28670	44	0.53
367	368	Philip	Harvey	pharveya7@wsj.com	Portugal	188.30.85.41	-8.16630	37.86690	55	0.18
381	382	Joan	Oliver	joliveral@chicagotribune.com	Russia	98.144.120.116	38.40172	54.03481	55	0.37
418	419	Paula	Fowler	pfowlerbm@apache.org	Philippines	221.234.139.145	120.65190	17.56470	55	0.43
473	474	Nicholas	Scott	nscottd5@samsung.com	Russia	165.111.40.207	104.24940	56.70320	44	0.37
550	551	Kathy	Burke	kburkefa@pen.io	Indonesia	197.17.149.125	115.11730	-8.46090	44	0.73
685	686	Jesse	Freeman	jfreemanj1@wikia.com	Iran	66.31.71.144	60.21430	26.22580	55	0.32
706	707	Kathryn	Mitchell	kmitchelljm@boston.com	Azerbaijan	113.1.57.57	50.04002	40.56441	44	0.66
723	724	Aaron	Chapman	achapmank3@mayoclinic.com	Indonesia	59.111.140.158	106.02910	-6.74290	44	0.39
843	844	Diana	Bennett	dbennettnf@wikipedia.org	Cuba	147.157.224.46	-74.15181	20.24673	44	0.46
853	854	Willie	Ross	wrossnp@networkadvertising.org	Indonesia	90.176.105.5	125.18950	1.45697	55	0.53
897	898	Julie	Mason	jmasonox@wikia.com	Libya	229.225.235.75	12.72778	32.75222	55	0.46

# Seleccionar un conjunto de registros del 0 al 9
d.iloc[range(10),]

	id	first_name	last_name	email	country	ip_address	longitude	latitude	age	score
0	1	Sandra	Sims	ssims0@microsoft.com	Armenia	63.84.115.63	44.43054	40.14493	44	0.62
1	2	Anna	Bishop	abishop1@mtv.com	China	204.108.246.11	118.29169	29.67594	90	0.22
2	3	Virginia	Rodriguez	vrodriguez2@so-net.ne.jp	Portugal	88.65.157.43	-8.63330	41.40000	39	0.33
3	4	Julia	Stanley	jstanley3@abc.net.au	China	76.197.170.103	102.19379	38.50062	NaN	0.64
4	5	Jacqueline	Gutierrez	jgutierrez4@shinystat.com	Poland	159.13.71.38	18.54003	54.17062	14	0.50
5	6	Nicole	Little	nlittle5@latimes.com	Panama	235.246.1.116	-78.13774	8.40693	10	0.30
6	7	Terry	Fox	tfox6@squarespace.com	Iran	81.219.197.208	55.49164	37.38071	89	0.33
7	8	Jeremy	Parker	jparker7@gnu.org	China	17.207.110.163	110.71266	22.01041	60	0.55
8	9	Kenneth	Walker	kwalker8@nba.com	Slovenia	117.248.31.17	13.97028	45.68472	no disponible	0.50
9	10	Willie	Brown	wbrown9@boston.com	Ethiopia	253.51.12.171	42.80000	9.35000	17	0.41

# Verificar que el tipo de variable sigue siendo dataframe
type(d.iloc[range(10),])

pandas.core.frame.DataFrame

# Obtener sólo una columna como una Serie
d.age

     44
     90
     39
    NaN
     14
      ... 
   40
   87
   50
   18
   83
Name: age, Length: 1000, dtype: object

# Ver tipo de dato, ahora ya no es dataframe, sinó una serie
type(d.age)

pandas.core.series.Series

d["score"].dtype, d['age'].dtype

(dtype('float64'), dtype('O'))

# Crear una nueva columna llamada "scoretimes2", que será la columna score multiplicada por 2
d["scoretimes2"] = d["score"]*2

# Visualizar los primeros 10 elementos, ahora con la nueva columna al final (scoretimes2)
d.head(10)

	id	first_name	last_name	email	country	ip_address	longitude	latitude	age	score	scoretimes2
0	1	Sandra	Sims	ssims0@microsoft.com	Armenia	63.84.115.63	44.43054	40.14493	44	0.62	1.24
1	2	Anna	Bishop	abishop1@mtv.com	China	204.108.246.11	118.29169	29.67594	90	0.22	0.44
2	3	Virginia	Rodriguez	vrodriguez2@so-net.ne.jp	Portugal	88.65.157.43	-8.63330	41.40000	39	0.33	0.66
3	4	Julia	Stanley	jstanley3@abc.net.au	China	76.197.170.103	102.19379	38.50062	NaN	0.64	1.28
4	5	Jacqueline	Gutierrez	jgutierrez4@shinystat.com	Poland	159.13.71.38	18.54003	54.17062	14	0.50	1.00
5	6	Nicole	Little	nlittle5@latimes.com	Panama	235.246.1.116	-78.13774	8.40693	10	0.30	0.60
6	7	Terry	Fox	tfox6@squarespace.com	Iran	81.219.197.208	55.49164	37.38071	89	0.33	0.66
7	8	Jeremy	Parker	jparker7@gnu.org	China	17.207.110.163	110.71266	22.01041	60	0.55	1.10
8	9	Kenneth	Walker	kwalker8@nba.com	Slovenia	117.248.31.17	13.97028	45.68472	no disponible	0.50	1.00
9	10	Willie	Brown	wbrown9@boston.com	Ethiopia	253.51.12.171	42.80000	9.35000	17	0.41	0.82

#Visualizar los últimos 10 elementos, ahora con la nueva columna al final (scoretimes2)
d.tail(10)

	id	first_name	last_name	email	country	ip_address	longitude	latitude	age	score	scoretimes2
990	991	Sara	Ramirez	sramirezri@earthlink.net	Sweden	228.154.219.24	18.06490	59.33260	79	0.38	0.76
991	992	James	Dunn	jdunnrj@fastcompany.com	China	51.111.161.164	111.54951	29.01573	41	0.60	1.20
992	993	Kathryn	Dunn	kdunnrk@answers.com	Peru	9.146.142.141	-76.70944	-11.94306	53	0.41	0.82
993	994	Joan	Richards	jrichardsrl@phoca.cz	Portugal	168.177.31.140	-8.30600	41.55710	28	0.46	0.92
994	995	Anthony	Richards	arichardsrm@arizona.edu	Indonesia	111.225.64.205	121.88890	-10.59710	24	0.36	0.72
995	996	Jean	Payne	jpaynern@bluehost.com	Mexico	191.211.230.212	-92.89380	16.21790	40	0.30	0.60
996	997	Michelle	Murphy	mmurphyro@senate.gov	Croatia	129.124.79.138	15.85000	45.78333	87	0.54	1.08
997	998	Joseph	Hill	jhillrp@princeton.edu	Japan	161.176.64.135	139.80000	36.41667	50	0.57	1.14
998	999	Roger	Harrison	rharrisonrq@gizmodo.com	Poland	7.194.73.231	20.37415	49.84485	18	0.37	0.74
999	1000	Kelly	Henry	khenryrr@squidoo.com	Indonesia	145.118.102.45	111.45570	-7.21950	83	0.39	0.78

# Información condensada del dataframe
d.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           1000 non-null   int64  
 1   first_name   1000 non-null   object 
 2   last_name    1000 non-null   object 
 3   email        1000 non-null   object 
 4   country      1000 non-null   object 
 5   ip_address   1000 non-null   object 
 6   longitude    999 non-null    float64
 7   latitude     999 non-null    float64
 8   age          999 non-null    object 
 9   score        1000 non-null   float64
 10  scoretimes2  1000 non-null   float64
dtypes: float64(4), int64(1), object(6)
memory usage: 86.1+ KB

# Obtener los nombres de las columnas
d.columns

Index(['id', 'first_name', 'last_name', 'email', 'country', 'ip_address',
       'longitude', 'latitude', 'age', 'score', 'scoretimes2'],
      dtype='object')

# Ver las estadísticas básicas, ahora se puede ver la nueva columna
d.describe()

	id	longitude	latitude	score	scoretimes2
count	1000.000000	999.000000	999.000000	1000.000000	1000.000000
mean	500.500000	46.831119	24.273880	0.501290	1.002580
std	288.819436	70.556083	24.272596	0.103403	0.206805
min	1.000000	-149.333330	-53.787690	0.180000	0.360000
25%	250.750000	7.046050	8.135695	0.430000	0.860000
50%	500.500000	42.174440	29.819050	0.510000	1.020000
75%	750.250000	112.349640	43.186925	0.570000	1.140000
max	1000.000000	175.496340	72.786840	0.830000	1.660000

# Generar la matriz de correlaciones, se realizará automáticamente sobre las columnas numéricas
d.corr()

	id	longitude	latitude	score	scoretimes2
id	1.000000	0.003941	-0.045643	-0.007075	-0.007075
longitude	0.003941	1.000000	-0.022150	0.004596	0.004596
latitude	-0.045643	-0.022150	1.000000	-0.016692	-0.016692
score	-0.007075	0.004596	-0.016692	1.000000	1.000000
scoretimes2	-0.007075	0.004596	-0.016692	1.000000	1.000000

# Crear una fila con datos
row = pd.DataFrame([[-21, "Pepe", "Sonora", "ps@ps.com", "Colombia", None, -72.09, 5.3, 25, -0.98, 0.00001]],
                    columns=d.columns)

# Visualizar la fila con datos
row

	id	first_name	last_name	email	country	ip_address	longitude	latitude	age	score	scoretimes2
0	-21	Pepe	Sonora	ps@ps.com	Colombia	None	-72.09	5.3	25	-0.98	0.00001

# Utilizando append se puede agregar al final la fila que se creó, al dataframe completo
d.append(row)

<ipython-input-20-e9b353430c8a>:2: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  d.append(row)

	id	first_name	last_name	email	country	ip_address	longitude	latitude	age	score	scoretimes2
0	1	Sandra	Sims	ssims0@microsoft.com	Armenia	63.84.115.63	44.43054	40.14493	44	0.62	1.24000
1	2	Anna	Bishop	abishop1@mtv.com	China	204.108.246.11	118.29169	29.67594	90	0.22	0.44000
2	3	Virginia	Rodriguez	vrodriguez2@so-net.ne.jp	Portugal	88.65.157.43	-8.63330	41.40000	39	0.33	0.66000
3	4	Julia	Stanley	jstanley3@abc.net.au	China	76.197.170.103	102.19379	38.50062	NaN	0.64	1.28000
4	5	Jacqueline	Gutierrez	jgutierrez4@shinystat.com	Poland	159.13.71.38	18.54003	54.17062	14	0.50	1.00000
...	...	...	...	...	...	...	...	...	...	...	...
996	997	Michelle	Murphy	mmurphyro@senate.gov	Croatia	129.124.79.138	15.85000	45.78333	87	0.54	1.08000
997	998	Joseph	Hill	jhillrp@princeton.edu	Japan	161.176.64.135	139.80000	36.41667	50	0.57	1.14000
998	999	Roger	Harrison	rharrisonrq@gizmodo.com	Poland	7.194.73.231	20.37415	49.84485	18	0.37	0.74000
999	1000	Kelly	Henry	khenryrr@squidoo.com	Indonesia	145.118.102.45	111.45570	-7.21950	83	0.39	0.78000
0	-21	Pepe	Sonora	ps@ps.com	Colombia	None	-72.09000	5.30000	25	-0.98	0.00001

1001 rows × 11 columns

Si quieres agregar columnas tambien puedes utilizar metodos como .join() o .concat()
Recuerda que para agregar tanto Filas como Columnas ambas deben contar con la dimensión de longitud del Dataframe de lo contrario generará un error del tipo ValueError

# Crear una fila con datos
row2 = pd.DataFrame([[-21, "Pepe", "Sonora", "ps@ps.com", "Colombia", None, -72.09, 5.3, 25, -0.98]],
                    columns=d.columns)

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/pandas/core/internals/construction.py in _finalize_columns_and_data(content, columns, dtype)
    968     try:
--> 969         columns = _validate_or_indexify_columns(contents, columns)
    970     except AssertionError as err:

/usr/local/lib/python3.10/dist-packages/pandas/core/internals/construction.py in _validate_or_indexify_columns(content, columns)
   1016             # caller's responsibility to check for this...
-> 1017             raise AssertionError(
   1018                 f"{len(columns)} columns passed, passed data had "

AssertionError: 11 columns passed, passed data had 10 columns

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
<ipython-input-21-0f3c75f68135> in <cell line: 2>()
      1 # Crear una fila con datos
----> 2 row2 = pd.DataFrame([[-21, "Pepe", "Sonora", "ps@ps.com", "Colombia", None, -72.09, 5.3, 25, -0.98]],
      3                     columns=d.columns)

/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    744                     if columns is not None:
    745                         columns = ensure_index(columns)
--> 746                     arrays, columns, index = nested_data_to_arrays(
    747                         # error: Argument 3 to "nested_data_to_arrays" has incompatible
    748                         # type "Optional[Collection[Any]]"; expected "Optional[Index]"

/usr/local/lib/python3.10/dist-packages/pandas/core/internals/construction.py in nested_data_to_arrays(data, columns, index, dtype)
    508         columns = ensure_index(data[0]._fields)
    509 
--> 510     arrays, columns = to_arrays(data, columns, dtype=dtype)
    511     columns = ensure_index(columns)
    512 

/usr/local/lib/python3.10/dist-packages/pandas/core/internals/construction.py in to_arrays(data, columns, dtype)
    873         arr = _list_to_arrays(data)
    874 
--> 875     content, columns = _finalize_columns_and_data(arr, columns, dtype)
    876     return content, columns
    877 

/usr/local/lib/python3.10/dist-packages/pandas/core/internals/construction.py in _finalize_columns_and_data(content, columns, dtype)
    970     except AssertionError as err:
    971         # GH#26429 do not raise user-facing AssertionError
--> 972         raise ValueError(err) from err
    973 
    974     if len(contents) and contents[0].dtype == np.object_:

ValueError: 11 columns passed, passed data had 10 columns

Carga de un nuevo dataset#

# Podrás cargar distintos dataset en nuevas variables
d = pd.read_csv("/content/drive/MyDrive/clase 6 partes/comptagevelo2009.csv")

Inspección del nuevo dataset#

# Visualizar los primeros 20 registros
d.head(20)

	Date	Unnamed: 1	Berri1	Maisonneuve_1	Maisonneuve_2	Brébeuf
0	01/01/2009	00:00	29	20	35	NaN
1	02/01/2009	00:00	19	3	22	NaN
2	03/01/2009	00:00	24	12	22	NaN
3	04/01/2009	00:00	24	8	15	NaN
4	05/01/2009	00:00	120	111	141	NaN
5	06/01/2009	00:00	261	146	236	NaN
6	07/01/2009	00:00	60	33	80	NaN
7	08/01/2009	00:00	24	14	14	NaN
8	09/01/2009	00:00	35	20	32	NaN
9	10/01/2009	00:00	81	45	79	NaN
10	11/01/2009	00:00	318	160	306	NaN
11	12/01/2009	00:00	105	99	170	NaN
12	13/01/2009	00:00	168	94	172	NaN
13	14/01/2009	00:00	145	87	175	NaN
14	15/01/2009	00:00	131	90	172	NaN
15	16/01/2009	00:00	93	49	97	NaN
16	17/01/2009	00:00	25	13	20	NaN
17	18/01/2009	00:00	52	22	55	NaN
18	19/01/2009	00:00	136	127	202	NaN
19	20/01/2009	00:00	147	85	176	NaN

# Ver los nombres de las columnas, y el tamaño (365 filas, 6 columnas)
d.columns, d.shape

(Index(['Date', 'Unnamed: 1', 'Berri1', 'Maisonneuve_1', 'Maisonneuve_2',
        'Brébeuf'],
       dtype='object'),
 (365, 6))

# Ver estadísticas básicas
d.describe()

	Berri1	Maisonneuve_1	Maisonneuve_2	Brébeuf
count	365.000000	365.000000	365.000000	178.000000
mean	2032.200000	1060.252055	2093.169863	2576.359551
std	1878.879799	1079.533086	1854.368523	2484.004743
min	0.000000	0.000000	0.000000	0.000000
25%	194.000000	90.000000	228.000000	0.000000
50%	1726.000000	678.000000	1686.000000	1443.500000
75%	3540.000000	1882.000000	3520.000000	4638.000000
max	6626.000000	4242.000000	6587.000000	7575.000000

# Extraer una columna como una serie, y mostrar los primeros 10 elementos
d["Berri1"].head(10)

   29
   19
   24
   24
  120
  261
   60
   24
   35
   81
Name: Berri1, dtype: int64

# La matriz completa como dataframe completo, y una sóla columna como una Serie
type(d), type(d["Berri1"])

(pandas.core.frame.DataFrame, pandas.core.series.Series)

# Con Unique se puede ver los Diferentes elementos, en la columna "Unnamed: 1" sólo hay valores "00:00"
d["Unnamed: 1"].unique()

array(['00:00'], dtype=object)

# Ver los diferentes elementos para Brébeuf, hay tanto valores nan como otros distintos
d["Brébeuf"].unique()

array([  nan, 2156., 4548., 7575., 7268., 2320., 5081., 4236., 3755.,
       6939., 4278., 5999., 3586., 4132., 7064., 6996., 4331., 6134.,
       4639., 4635., 2144., 5041., 7219., 7208., 5237., 5869., 1753.,
       7194., 5384., 7121., 5259., 5103., 5514., 3524., 4187., 4359.,
       7127., 6971., 6178., 5078., 4416., 5683., 5647., 6901., 4902.,
       3752., 4822., 4015., 5534., 7044., 6242., 6239., 5661., 1445.,
       3064., 5941., 7052., 6864., 5837., 4185., 4036., 4375., 6763.,
       6686., 6520., 6044., 5205., 3959., 4263., 4379., 4256., 3985.,
       2169., 2550., 2605., 3599., 1923., 2044., 3183., 2709., 2547.,
        656., 1415., 1342., 1799., 1588., 1906.,  705., 1116., 1773.,
       2211.,  984., 1967.,  663., 1003.,  825.,  924.,  870., 1373.,
       1311., 1238.,  919., 1006., 1483., 1374., 1442.,  633.,  952.,
        286.,  833., 1191., 1114., 1075., 1230.,  958.,  309.,  859.,
       1208.,  798., 1046.,  906.,  864.,  552., 1063., 1341., 1300.,
       1175., 1108., 1005.,  321.,  404.,    0.])

# Ver los diferentes elementos en la columna Berri1
d["Berri1"].unique()

array([  29,   19,   24,  120,  261,   60,   35,   81,  318,  105,  168,
        145,  131,   93,   25,   52,  136,  147,  109,  172,  148,   15,
        209,   92,  110,   14,  158,  179,  122,   95,  185,   82,  190,
        228,  306,  188,   98,  139,  258,  304,  326,  134,  125,   96,
         65,  123,  129,  154,  239,  198,   32,   67,  157,  164,  300,
        176,  195,  310,    7,  366,  234,  132,  203,  298,  541,  525,
        871,  592,  455,  446,  441,  266,  189,  343,  292,  355,  245,
          0,  445, 1286, 1178, 2131, 2709,  752, 1886, 2069, 3132, 3668,
       1368, 4051, 2286, 3519, 3520, 1925, 2125, 2662, 4403, 4338, 2757,
        970, 2767, 1493,  728, 3982, 4742, 5278, 2344, 4094,  784, 1048,
       2442, 3686, 3042, 5728, 3815, 3540, 4775, 4434, 4363, 2075, 2338,
       1387, 2063, 2031, 3274, 4325, 5430, 6028, 3876, 2742, 4973, 1125,
       3460, 4449, 3576, 4027, 4313, 3182, 5668, 6320, 2397, 2857, 2590,
       3234, 5138, 5799, 4911, 4333, 3680, 1536, 3064, 1004, 4709, 4471,
       4432, 2997, 2544, 5121, 3862, 3036, 3744, 6626, 6274, 1876, 4393,
       3471, 3537, 6100, 3489, 4859, 2991, 3588, 5607, 5754, 3440, 5124,
       4054, 4372, 1801, 4088, 5891, 3754, 5267, 3146,   63,   77, 5904,
       4417, 5611, 4197, 4265, 4589, 2775, 2999, 3504, 5538, 5386, 3916,
       3307, 4382, 5327, 3796, 2832, 3492, 2888, 4120, 5450, 4722, 4707,
       4439, 2277, 4572, 5298, 5451, 5372, 4566, 3533, 3888, 3683, 5452,
       5575, 5496, 4864, 3985, 2695, 4196, 5169, 4891, 4915, 2435, 2674,
       2855, 4787, 2620, 2878, 4820, 3774, 2603,  725, 1941, 2272, 3003,
       2643, 2865,  993, 1336, 2935, 3852, 2115, 3336, 1302, 1407, 1090,
       1171, 1671, 2456, 2383, 1130, 1241, 2570, 2605, 2904, 1322, 1792,
        542, 1124, 2119, 2072, 1996, 2130, 1835,  473, 1141, 2293, 1655,
       1974, 1767, 1735,  872, 1541, 2540, 2526, 2366, 2224, 2007,  493,
        852, 1881, 2052, 1921, 1935, 1065, 1173,  743, 1579, 1574, 1726,
       1027,  810,  671,  747, 1092, 1377,  606, 1108,  594,  501,  669,
        570,  219,  194,  106,  130,  271,  308,  296,  214,  133,  135,
        207,   74,   34,   40,   66,   61,   89,   76,   53])

# Ver los tipos de datos almacenados en columnas seleccionadas
d["Berri1"].dtype, d["Date"].dtype, d["Unnamed: 1"].dtype, d['Brébeuf'].dtype

(dtype('int64'), dtype('O'), dtype('O'), dtype('float64'))

# Extraer dos columnas, y visualizar los primeros 10 elementos
d[["Berri1", "Maisonneuve_1"]].head(10)

	Berri1	Maisonneuve_1
0	29	20
1	19	3
2	24	12
3	24	8
4	120	111
5	261	146
6	60	33
7	24	14
8	35	20
9	81	45

# Ver el conteo de elementos con count(), importante para ver si hay columnas con NaN (caso de la columna Brebeuf)
d.count()

Date             365
Unnamed: 1       365
Berri1           365
Maisonneuve_1    365
Maisonneuve_2    365
Brébeuf          178
dtype: int64

Fijación de datos#

# Convertir la columna Date, como los índices del dataframe
d.index = pd.to_datetime(d.Date)

# Eliminar las columnas "Date" y "Unnamed: 1"
del(d["Date"])
del(d["Unnamed: 1"])

# Mostrar los primeros elementos, ya modificados
d.head()

<ipython-input-15-0678c59164e4>:2: UserWarning: Parsing dates in DD/MM/YYYY format when dayfirst=False (the default) was specified. This may lead to inconsistently parsed dates! Specify a format to ensure consistent parsing.
  d.index = pd.to_datetime(d.Date)

	Berri1	Maisonneuve_1	Maisonneuve_2	Brébeuf
Date
2009-01-01	29	20	35	NaN
2009-02-01	19	3	22	NaN
2009-03-01	24	12	22	NaN
2009-04-01	24	8	15	NaN
2009-05-01	120	111	141	NaN

# Ver los nuevos indices
d.index

DatetimeIndex(['2009-01-01', '2009-02-01', '2009-03-01', '2009-04-01',
               '2009-05-01', '2009-06-01', '2009-07-01', '2009-08-01',
               '2009-09-01', '2009-10-01',
               ...
               '2009-12-22', '2009-12-23', '2009-12-24', '2009-12-25',
               '2009-12-26', '2009-12-27', '2009-12-28', '2009-12-29',
               '2009-12-30', '2009-12-31'],
              dtype='datetime64[ns]', name='Date', length=365, freq=None)

# Renombrar todas las columnas
d.columns=["Berri", "Mneuve1", "Mneuve2", "Brebeuf"]
d.head()

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2009-01-01	29	20	35	NaN
2009-02-01	19	3	22	NaN
2009-03-01	24	12	22	NaN
2009-04-01	24	8	15	NaN
2009-05-01	120	111	141	NaN

# Ver el total de valores nulos NaN para todas las columnas
for col in d.columns:
    print(col, np.sum(pd.isnull(d[col])))

Berri 0
Mneuve1 0
Mneuve2 0
Brebeuf 187

# Ver los nombres nuevos de las columnas
d.columns

Index(['Berri', 'Mneuve1', 'Mneuve2', 'Brebeuf'], dtype='object')

# Rellenar los valores de las columnas con la media
d.Brebeuf.fillna(d.Brebeuf.mean(), inplace=True)

# Reorganizar el dataframe, organizandolo por orden de fechas
d.sort_index(inplace=True)
d.head(20)

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2009-01-01	29	20	35	2576.359551
2009-01-02	14	2	2	2576.359551
2009-01-03	67	30	80	2576.359551
2009-01-04	0	0	0	2576.359551
2009-01-05	1925	1256	1501	2576.359551
2009-01-06	3274	2093	2726	2576.359551
2009-01-07	4471	2239	3051	2576.359551
2009-01-08	63	1053	3351	5869.000000
2009-01-09	5298	2796	5765	6939.000000
2009-01-10	2643	22	3528	1588.000000
2009-01-11	1141	540	1265	859.000000
2009-01-12	1092	599	1473	0.000000
2009-01-13	168	94	172	2576.359551
2009-01-14	145	87	175	2576.359551
2009-01-15	131	90	172	2576.359551
2009-01-16	93	49	97	2576.359551
2009-01-17	25	13	20	2576.359551
2009-01-18	52	22	55	2576.359551
2009-01-19	136	127	202	2576.359551
2009-01-20	147	85	176	2576.359551

# Visualizar la cantidad de elementos, ahora que ya se rellenó con la media, se tienen todos los 365 valores
d.count()

Berri      365
Mneuve1    365
Mneuve2    365
Brebeuf    365
dtype: int64

Filtrados de datos#

# Obtener las muestras del dataframe, cuando los valores de la columna Berri sean mayores a 6000
d[d.Berri>6000]

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2009-05-06	6028	4120	4223	2576.359551
2009-06-17	6320	3388	6047	2576.359551
2009-07-15	6100	3767	5536	6939.000000
2009-09-07	6626	4227	5751	7575.000000
2009-10-07	6274	4242	5435	7268.000000

# Obtener las muestras del dataframe, cuando se complen dos condiciones, se utiliza &
d[(d.Berri>6000) & (d.Brebeuf<7000)]

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2009-05-06	6028	4120	4223	2576.359551
2009-06-17	6320	3388	6047	2576.359551
2009-07-15	6100	3767	5536	6939.000000

Indexación y localización#

# Obtener datos para cuando Berri sea mayor a 5500 y organizarlos por fecha (indice)
d[d.Berri>5500].sort_index()

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2009-03-08	5904	3102	4853	7194.000000
2009-05-06	6028	4120	4223	2576.359551
2009-05-08	5611	2646	5201	7121.000000
2009-05-21	5728	3693	5397	2576.359551
2009-06-16	5668	3499	5609	2576.359551
2009-06-17	6320	3388	6047	2576.359551
2009-06-23	5799	3114	5386	2576.359551
2009-07-15	6100	3767	5536	6939.000000
2009-07-20	5607	3825	5092	7064.000000
2009-07-21	5754	3745	5357	6996.000000
2009-07-28	5891	3292	5437	7219.000000
2009-09-07	6626	4227	5751	7575.000000
2009-09-09	5575	2727	6535	6686.000000
2009-10-07	6274	4242	5435	7268.000000
2009-12-08	5538	2368	5107	7127.000000

# Localizar sólo un conjunto de datos
d.iloc[100:110]

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2009-04-11	1974	1113	2693	1046.000000
2009-04-12	1108	595	1472	0.000000
2009-04-13	0	0	0	2576.359551
2009-04-14	0	0	0	2576.359551
2009-04-15	0	0	0	2576.359551
2009-04-16	0	0	0	2576.359551
2009-04-17	1286	820	1436	2576.359551
2009-04-18	1178	667	826	2576.359551
2009-04-19	2131	1155	1426	2576.359551
2009-04-20	2709	1697	2646	2576.359551

# Seleccionar datos para un rango de fechas
d.loc["2009-10-01":"2009-10-10"]

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2009-10-01	81	45	79	2576.359551
2009-10-02	228	101	260	2576.359551
2009-10-03	366	203	354	2576.359551
2009-10-04	0	0	0	2576.359551
2009-10-05	728	362	523	2576.359551
2009-10-06	3460	2354	3978	2576.359551
2009-10-07	6274	4242	5435	7268.000000
2009-10-08	2999	1545	3185	4187.000000
2009-10-09	5496	2921	6587	6520.000000
2009-10-10	1407	725	1443	1003.000000

# Organizar los datos según la columna Berri de menor a mayor
d.sort_values(by="Berri").head(100)

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2009-07-04	0	0	0	2576.359551
2009-03-30	0	0	0	2576.359551
2009-04-04	0	0	0	2576.359551
2009-04-13	0	0	0	2576.359551
2009-04-14	0	0	0	2576.359551
...	...	...	...	...
2009-12-22	207	107	353	0.000000
2009-01-27	209	119	221	2576.359551
2009-02-13	209	127	222	2576.359551
2009-12-18	214	119	373	0.000000
2009-10-12	219	110	363	0.000000

100 rows × 4 columns

# Organizar los datos según la columna Berri de mayor a menor (parámetro ascending=False), y localidar sólo en rango fechas
d.sort_values(by="Berri",ascending=False).loc["2009-10-01":"2009-10-10"]

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2009-10-07	6274	4242	5435	7268.000000
2009-10-09	5496	2921	6587	6520.000000
2009-10-06	3460	2354	3978	2576.359551
2009-10-08	2999	1545	3185	4187.000000
2009-10-10	1407	725	1443	1003.000000
2009-10-05	728	362	523	2576.359551
2009-10-03	366	203	354	2576.359551
2009-10-02	228	101	260	2576.359551
2009-10-01	81	45	79	2576.359551
2009-10-04	0	0	0	2576.359551

Series de tiempo#

d.head(10)

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2009-01-01	29	20	35	2576.359551
2009-01-02	14	2	2	2576.359551
2009-01-03	67	30	80	2576.359551
2009-01-04	0	0	0	2576.359551
2009-01-05	1925	1256	1501	2576.359551
2009-01-06	3274	2093	2726	2576.359551
2009-01-07	4471	2239	3051	2576.359551
2009-01-08	63	1053	3351	5869.000000
2009-01-09	5298	2796	5765	6939.000000
2009-01-10	2643	22	3528	1588.000000

# El metodo rolling nos proporciona una ventana móvil para operaciones matemáticas. En este caso la media por los 10 primeros meses
d.rolling(2).mean().head(10)

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2009-01-01	NaN	NaN	NaN	NaN
2009-01-02	21.5	11.0	18.5	2576.359551
2009-01-03	40.5	16.0	41.0	2576.359551
2009-01-04	33.5	15.0	40.0	2576.359551
2009-01-05	962.5	628.0	750.5	2576.359551
2009-01-06	2599.5	1674.5	2113.5	2576.359551
2009-01-07	3872.5	2166.0	2888.5	2576.359551
2009-01-08	2267.0	1646.0	3201.0	4222.679775
2009-01-09	2680.5	1924.5	4558.0	6404.000000
2009-01-10	3970.5	1409.0	4646.5	4263.500000

# Asigna un valor una marca temporal al index
d.index = d.index + pd.Timedelta("5m")
d.head()

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2009-01-01 00:05:00	29	20	35	2576.359551
2009-01-02 00:05:00	14	2	2	2576.359551
2009-01-03 00:05:00	67	30	80	2576.359551
2009-01-04 00:05:00	0	0	0	2576.359551
2009-01-05 00:05:00	1925	1256	1501	2576.359551

#Visualizamos el mismo rango de tiempo pero esta vez con el metodo .shift observamos este rango un año despues
d.shift(freq=pd.Timedelta(days=365)).head()

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2010-01-01 00:05:00	29	20	35	2576.359551
2010-01-02 00:05:00	14	2	2	2576.359551
2010-01-03 00:05:00	67	30	80	2576.359551
2010-01-04 00:05:00	0	0	0	2576.359551
2010-01-05 00:05:00	1925	1256	1501	2576.359551

DownSampling#

La reducción de muestreo, implica reducir el número de instancias en la clase mayoritaria. Esto se hace eliminando aleatoriamente las instancias de la clase mayoritaria hasta que la distribución de la clase esté más equilibrada.

#Visualizamos el primer mes con saltos temporales de 2 días y los primeros valores de esa fecha
d.resample(pd.Timedelta("2d")).first().head()

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2009-01-01	29	20	35	2576.359551
2009-01-03	67	30	80	2576.359551
2009-01-05	1925	1256	1501	2576.359551
2009-01-07	4471	2239	3051	2576.359551
2009-01-09	5298	2796	5765	6939.000000

# Por otro lado utilizando .mean() podemos observar el valor promedio obtenido entre los datos pero con la particularidad de que estos han sido realizados con el valor del día siguiente
d.resample(pd.Timedelta("2d")).mean().head()

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2009-01-01	21.5	11.0	18.5	2576.359551
2009-01-03	33.5	15.0	40.0	2576.359551
2009-01-05	2599.5	1674.5	2113.5	2576.359551
2009-01-07	2267.0	1646.0	3201.0	4222.679775
2009-01-09	3970.5	1409.0	4646.5	4263.500000

UpSampling#

El sobremuestreo implica aumentar el número de instancias de la clase minoritaria para equilibrar el conjunto de datos. Por lo general, esto se hace replicando aleatoriamente instancias de la clase minoritaria o generando muestras sintéticas utilizando técnicas como SMOTE (Técnica de sobremuestreo de minorías sintéticas). o como en este caso usando en metodo pad de .fillna que permite rellenar los valores faltantes tipo Nan

#Usamos resample para generar un intervalo de 12 horas en las misma fecha, duplicando así la cantidad de datos, sin embargo encontramos que estan vacios
d.resample(pd.Timedelta("12h")).first().head()

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2009-01-01 00:00:00	29.0	20.0	35.0	2576.359551
2009-01-01 12:00:00	NaN	NaN	NaN	NaN
2009-01-02 00:00:00	14.0	2.0	2.0	2576.359551
2009-01-02 12:00:00	NaN	NaN	NaN	NaN
2009-01-03 00:00:00	67.0	30.0	80.0	2576.359551

#Para generar el duplicado de los datos con el intervalo de tiempo, sin que haya una ausencia de los datos, realizamos un rellenado con de estas con el metodo fillna
#fillna cuenta con un parametro llamado method que nos permite establecer el metodo de llenado, en este caso usamos 'pad' de Padding, se rellenan los valores faltantes utilizando el último valor válido conocido
d.resample(pd.Timedelta("12h")).first().fillna(method="pad").head()

	Berri	Mneuve1	Mneuve2	Brebeuf
Date
2009-01-01 00:00:00	29.0	20.0	35.0	2576.359551
2009-01-01 12:00:00	29.0	20.0	35.0	2576.359551
2009-01-02 00:00:00	14.0	2.0	2.0	2576.359551
2009-01-02 12:00:00	14.0	2.0	2.0	2576.359551
2009-01-03 00:00:00	67.0	30.0	80.0	2576.359551

Creación de DataFrames desde Arrays#

# Creación de un array de tamaño 20 filas, 5 columnas, tal que tiene valores de 0 a 9 aleatorios enteros
a = np.random.randint(10,size=(20,5))
a

array([[0, 6, 7, 9, 0],
       [8, 9, 1, 9, 5],
       [3, 6, 5, 7, 8],
       [2, 4, 9, 3, 7],
       [6, 7, 9, 5, 5],
       [7, 4, 9, 0, 9],
       [6, 6, 1, 7, 1],
       [1, 2, 7, 9, 0],
       [4, 0, 8, 0, 9],
       [4, 2, 2, 9, 5],
       [4, 6, 7, 6, 7],
       [1, 4, 0, 5, 5],
       [7, 4, 0, 1, 0],
       [0, 2, 9, 5, 9],
       [6, 4, 5, 2, 2],
       [5, 7, 0, 2, 0],
       [8, 6, 7, 5, 1],
       [5, 0, 4, 0, 0],
       [5, 3, 7, 5, 0],
       [9, 2, 1, 1, 3]])

# Conversión del array "a"  a un DataFrame "k", se le indican los nombres de las columnas, y de los indices
k = pd.DataFrame(a, columns=["uno", "dos", "tres", "cuatro", "cinco"], index=list(range(10,10+len(a))))
k

	uno	dos	tres	cuatro	cinco
10	0	6	7	9	0
11	8	9	1	9	5
12	3	6	5	7	8
13	2	4	9	3	7
14	6	7	9	5	5
15	7	4	9	0	9
16	6	6	1	7	1
17	1	2	7	9	0
18	4	0	8	0	9
19	4	2	2	9	5
20	4	6	7	6	7
21	1	4	0	5	5
22	7	4	0	1	0
23	0	2	9	5	9
24	6	4	5	2	2
25	5	7	0	2	0
26	8	6	7	5	1
27	5	0	4	0	0
28	5	3	7	5	0
29	9	2	1	1	3

# Conversión del DataFrame "k" a un array "x"
x=pd.DataFrame.to_numpy(k)

# Visualización del array
x

array([[0, 6, 7, 9, 0],
       [8, 9, 1, 9, 5],
       [3, 6, 5, 7, 8],
       [2, 4, 9, 3, 7],
       [6, 7, 9, 5, 5],
       [7, 4, 9, 0, 9],
       [6, 6, 1, 7, 1],
       [1, 2, 7, 9, 0],
       [4, 0, 8, 0, 9],
       [4, 2, 2, 9, 5],
       [4, 6, 7, 6, 7],
       [1, 4, 0, 5, 5],
       [7, 4, 0, 1, 0],
       [0, 2, 9, 5, 9],
       [6, 4, 5, 2, 2],
       [5, 7, 0, 2, 0],
       [8, 6, 7, 5, 1],
       [5, 0, 4, 0, 0],
       [5, 3, 7, 5, 0],
       [9, 2, 1, 1, 3]])

Graficando#

# Graficar usando plot, se grafican las variables numéricas, donde el eje X corresponde al índice con las fechas
d.plot(figsize=(15,3))

<AxesSubplot:xlabel='Date'>

../../_images/7978dcf0c2bfedaa4c7d0aa4dd93ad250d86201a38b3235a21acbeadccb15a1e.png

# Graficas sólo una columna (Berri)
d.Berri.plot()

<AxesSubplot:xlabel='Date'>

../../_images/9fb2e510abf61f2ed4f0d2dab4bb6d4d50bcc9d6f7aef06a8b1cbff5271689d9.png

# Graficas la columna Berri de forma suma acumulada
d.Berri.cumsum().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7f894e1c47d0>

../../_images/fa18cdf74fd55879d0937724a83b81363d76e1209c61a4bdb9a4567ac5aa8102.png

# Graficar puntos para datos de dos columnas
plt.scatter(d.Berri, d.Brebeuf)

<matplotlib.collections.PathCollection at 0x7f484cf3f7f0>

../../_images/de49c8949426af147c45f279f0f9d0b5fdd1c3fc60fa9b9de1276a76b80f32ec.png

# Graficar puntos todas las columnas contra todas, la diagonal principal corresponde al histograma
pd.plotting.scatter_matrix(d, figsize=(15,15))

array([[<AxesSubplot:xlabel='Berri', ylabel='Berri'>,
        <AxesSubplot:xlabel='Mneuve1', ylabel='Berri'>,
        <AxesSubplot:xlabel='Mneuve2', ylabel='Berri'>,
        <AxesSubplot:xlabel='Brebeuf', ylabel='Berri'>],
       [<AxesSubplot:xlabel='Berri', ylabel='Mneuve1'>,
        <AxesSubplot:xlabel='Mneuve1', ylabel='Mneuve1'>,
        <AxesSubplot:xlabel='Mneuve2', ylabel='Mneuve1'>,
        <AxesSubplot:xlabel='Brebeuf', ylabel='Mneuve1'>],
       [<AxesSubplot:xlabel='Berri', ylabel='Mneuve2'>,
        <AxesSubplot:xlabel='Mneuve1', ylabel='Mneuve2'>,
        <AxesSubplot:xlabel='Mneuve2', ylabel='Mneuve2'>,
        <AxesSubplot:xlabel='Brebeuf', ylabel='Mneuve2'>],
       [<AxesSubplot:xlabel='Berri', ylabel='Brebeuf'>,
        <AxesSubplot:xlabel='Mneuve1', ylabel='Brebeuf'>,
        <AxesSubplot:xlabel='Mneuve2', ylabel='Brebeuf'>,
        <AxesSubplot:xlabel='Brebeuf', ylabel='Brebeuf'>]], dtype=object)

../../_images/cf3a8e5d4040b9e7d667e17da4d7b10bf50d3105fb91da85a3f30b7b2ab0f16a.png

Agrupamiento#

# Dado que los índices corresponden a fechas, obtener solamente el mes de la fecha
for i in d.index:
    print(i.month)

# Crear nueva columna con el mes, que se extrae de los indices
d["month"] = [i.month for i in d.index]
d.head(100)

	Berri	Mneuve1	Mneuve2	Brebeuf	month
Date
2009-01-01	29	20	35	2576.359551	1
2009-01-02	14	2	2	2576.359551	1
2009-01-03	67	30	80	2576.359551	1
2009-01-04	0	0	0	2576.359551	1
2009-01-05	1925	1256	1501	2576.359551	1
...	...	...	...	...	...
2009-04-06	5278	3499	4795	2576.359551	4
2009-04-07	2544	1653	2325	2576.359551	4
2009-04-08	4417	2382	4350	5384.000000	4
2009-04-09	4566	2393	5327	5837.000000	4
2009-04-10	1336	0	1620	1116.000000	4

100 rows × 5 columns

# Agrupar por el mes (columna creada), los valores máximos
d.groupby("month").max()

	Berri	Mneuve1	Mneuve2	Brebeuf
month
1	5298	2796	5765	6939.0
2	5451	2868	5517	7052.0
3	5904	3523	5762	7194.0
4	5278	3499	5327	5837.0
5	6028	4120	5397	7121.0
6	6320	3499	6047	5259.0
7	6100	3825	5536	7219.0
8	5452	2865	6379	7044.0
9	6626	4227	6535	7575.0
10	6274	4242	6587	7268.0
11	4864	2648	5895	6044.0
12	5538	2983	5107	7127.0

# Agrupar por el mes (columna creada), el conteo de datos
d.groupby("month").count()

	Berri	Mneuve1	Mneuve2	Brebeuf
month
1	31	31	31	31
2	28	28	28	28
3	31	31	31	31
4	30	30	30	30
5	31	31	31	31
6	30	30	30	30
7	31	31	31	31
8	31	31	31	31
9	30	30	30	30
10	31	31	31	31
11	30	30	30	30
12	31	31	31	31

	uno	dos	tres	cuatro	cinco
10	0	6	7	9	0
11	8	9	1	9	5
12	3	6	5	7	8
13	2	4	9	3	7
14	6	7	9	5	5
15	7	4	9	0	9
16	6	6	1	7	1
17	1	2	7	9	0
18	4	0	8	0	9
19	4	2	2	9	5
20	4	6	7	6	7
21	1	4	0	5	5
22	7	4	0	1	0
23	0	2	9	5	9
24	6	4	5	2	2
25	5	7	0	2	0
26	8	6	7	5	1
27	5	0	4	0	0
28	5	3	7	5	0
29	9	2	1	1	3

	uno	dos	tres	cuatro	cinco
10	0	6	7	9	0
11	8	9	1	9	5
12	3	6	5	7	8
13	2	4	9	3	7
14	6	7	9	5	5
15	7	4	9	0	9
16	6	6	1	7	1
17	1	2	7	9	0
18	4	0	8	0	9
19	4	2	2	9	5
20	4	6	7	6	7
21	1	4	0	5	5
22	7	4	0	1	0
23	0	2	9	5	9
24	6	4	5	2	2
25	5	7	0	2	0
26	8	6	7	5	1
27	5	0	4	0	0
28	5	3	7	5	0
29	9	2	1	1	3

	uno	dos	tres	cuatro	cinco
10	0	6	7	9	0
11	8	9	1	9	5
12	3	6	5	7	8
13	2	4	9	3	7
14	6	7	9	5	5
15	7	4	9	0	9
16	6	6	1	7	1
17	1	2	7	9	0
18	4	0	8	0	9
19	4	2	2	9	5
20	4	6	7	6	7
21	1	4	0	5	5
22	7	4	0	1	0
23	0	2	9	5	9
24	6	4	5	2	2
25	5	7	0	2	0
26	8	6	7	5	1
27	5	0	4	0	0
28	5	3	7	5	0
29	9	2	1	1	3