Aprendizaje de Máquina con Big Data y Apache Spark#

Exploración y preparación de los datos#

Spark Logo + Python Logo

Para el procesamiento de Big Data utilizando Ciencia de Datos/Analítica de Datos/Aprendizaje de Máquina, existe una metodología ampliamente utilizada en la industria conocida como CRISP-DM creada por IBM, el siguiente diagrama representa la secuencia de pasos correspondientes a esta metodología:

CRISP-DM

En este diagrama se identifican las siguientes etapas:

  1. Entendimiento del negocio (objetivos)

  2. Exploración de los datos

  3. Preparación de los datos

  4. Modelado

  5. Evaluación

  6. Despliegue

El cual representa un proceso iterativo, que comienza con el entendimiento del negocio y termina, y vuelve a comenzar, con la evaluación de resultados.

Este notebook está inspirado en el capítulo 2 del libro Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow

Descripción del problema#

Usted es el científico de datos de una empresa de bienes raíces en el estado de California de los Estados Unidos de América. Desde hace un tiempo, se ha identificado una gran dificultad a la hora de asignar un precio acertado a una propiedad para ponerla en el mercado, aumentando en gran medida el tiempo que le toma a un agente concretar la venta de la propiedad. En la mayoría de los casos, se ha evidenciado que el precio inicial está por encima del precio real de mercado, lo que hace la propiedad poco atractiva para los compradores; aunque tampoco se descarta que para algunas propiedades se haya listado un precio menor al del mercado, reduciendo el beneficio económico de la compañia y los agentes de venta.

En este panorama, se le ha asignado la tarea de predecir el valor promedio de las propiedades en el estado de California a partir de un conjunto de datos con características relevantes.

Configuración del ambiente de Google Colaboratory#

# Descargar Java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
# A continuación, instalaremos Apache Spark 3.0.1 con Hadoop 2.7 desde aquí.
!wget https://dlcdn.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz
# Ahora, sólo tenemos que descomprimir esa carpeta.
!tar xf spark-3.5.1-bin-hadoop3.tgz
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.1-bin-hadoop3"
# Instalación de los paquetes necesarios
!pip install pyspark==3.5.1
!pip install findspark
--2024-07-31 23:30:08--  https://dlcdn.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz
Resolving dlcdn.apache.org (dlcdn.apache.org)... 151.101.2.132, 2a04:4e42::644
Connecting to dlcdn.apache.org (dlcdn.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 400446614 (382M) [application/x-gzip]
Saving to: ‘spark-3.5.1-bin-hadoop3.tgz’

spark-3.5.1-bin-had 100%[===================>] 381.90M   230MB/s    in 1.7s    

2024-07-31 23:30:10 (230 MB/s) - ‘spark-3.5.1-bin-hadoop3.tgz’ saved [400446614/400446614]

Collecting pyspark==3.5.1
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 317.0/317.0 MB 4.0 MB/s eta 0:00:00
?25h  Preparing metadata (setup.py) ... ?25l?25hdone
Requirement already satisfied: py4j==0.10.9.7 in /usr/local/lib/python3.10/dist-packages (from pyspark==3.5.1) (0.10.9.7)
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... ?25l?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488490 sha256=2bfdb48ab300adfb5d4748203931805cc9ac666f8b9ecd23ffc87c749d4e3b01
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1
Collecting findspark
  Downloading findspark-2.0.1-py2.py3-none-any.whl.metadata (352 bytes)
Downloading findspark-2.0.1-py2.py3-none-any.whl (4.4 kB)
Installing collected packages: findspark
Successfully installed findspark-2.0.1
import os
import tarfile
import urllib.request
from pathlib import Path

def fetch_housing_data():
    tarball_path = Path("datasets/housing.tgz")
    if not tarball_path.is_file():
        Path("datasets").mkdir(parents=True, exist_ok=True)
        url = "https://github.com/ageron/data/raw/main/housing.tgz"
        urllib.request.urlretrieve(url, tarball_path)
        with tarfile.open(tarball_path) as housing_tarball:
            housing_tarball.extractall(path="datasets")

fetch_housing_data()
import datetime as dt
import findspark
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import pyspark.ml as ml
from pyspark.sql import functions as fct
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, StringType

findspark.init()

Crear Sesión de Spark e importar los datos#

ss = (SparkSession
      .builder
      .appName("data_exploration_preparation")
      .getOrCreate())
path = "/content/datasets/housing/housing.csv"
housing_data = ss.read.csv(path, inferSchema=True, header=True)
housing_data.show()
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|  -122.23|   37.88|              41.0|      880.0|         129.0|     322.0|     126.0|       8.3252|          452600.0|       NEAR BAY|
|  -122.22|   37.86|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|       8.3014|          358500.0|       NEAR BAY|
|  -122.24|   37.85|              52.0|     1467.0|         190.0|     496.0|     177.0|       7.2574|          352100.0|       NEAR BAY|
|  -122.25|   37.85|              52.0|     1274.0|         235.0|     558.0|     219.0|       5.6431|          341300.0|       NEAR BAY|
|  -122.25|   37.85|              52.0|     1627.0|         280.0|     565.0|     259.0|       3.8462|          342200.0|       NEAR BAY|
|  -122.25|   37.85|              52.0|      919.0|         213.0|     413.0|     193.0|       4.0368|          269700.0|       NEAR BAY|
|  -122.25|   37.84|              52.0|     2535.0|         489.0|    1094.0|     514.0|       3.6591|          299200.0|       NEAR BAY|
|  -122.25|   37.84|              52.0|     3104.0|         687.0|    1157.0|     647.0|         3.12|          241400.0|       NEAR BAY|
|  -122.26|   37.84|              42.0|     2555.0|         665.0|    1206.0|     595.0|       2.0804|          226700.0|       NEAR BAY|
|  -122.25|   37.84|              52.0|     3549.0|         707.0|    1551.0|     714.0|       3.6912|          261100.0|       NEAR BAY|
|  -122.26|   37.85|              52.0|     2202.0|         434.0|     910.0|     402.0|       3.2031|          281500.0|       NEAR BAY|
|  -122.26|   37.85|              52.0|     3503.0|         752.0|    1504.0|     734.0|       3.2705|          241800.0|       NEAR BAY|
|  -122.26|   37.85|              52.0|     2491.0|         474.0|    1098.0|     468.0|        3.075|          213500.0|       NEAR BAY|
|  -122.26|   37.84|              52.0|      696.0|         191.0|     345.0|     174.0|       2.6736|          191300.0|       NEAR BAY|
|  -122.26|   37.85|              52.0|     2643.0|         626.0|    1212.0|     620.0|       1.9167|          159200.0|       NEAR BAY|
|  -122.26|   37.85|              50.0|     1120.0|         283.0|     697.0|     264.0|        2.125|          140000.0|       NEAR BAY|
|  -122.27|   37.85|              52.0|     1966.0|         347.0|     793.0|     331.0|        2.775|          152500.0|       NEAR BAY|
|  -122.27|   37.85|              52.0|     1228.0|         293.0|     648.0|     303.0|       2.1202|          155500.0|       NEAR BAY|
|  -122.26|   37.84|              50.0|     2239.0|         455.0|     990.0|     419.0|       1.9911|          158700.0|       NEAR BAY|
|  -122.27|   37.84|              52.0|     1503.0|         298.0|     690.0|     275.0|       2.6033|          162900.0|       NEAR BAY|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
only showing top 20 rows

División en el conjunto de entrenamiento y conjunto de evaluación#

train_size = 0.7 # Tamaño del conjunto de entrenamiento: 70%
test_size = 0.3 # Tamaño del conjunto de evaluación: 30%
housing_data_train, housing_data_test = housing_data.randomSplit([train_size, test_size], seed=42)
housing_data_pd = housing_data_train.toPandas() # Convertirlo a un DataFrame de pandas para generar visualizaciones

Exploración de los datos#

housing_data_train.printSchema() # Esquema relacional del conjunto de datos
root
 |-- longitude: double (nullable = true)
 |-- latitude: double (nullable = true)
 |-- housing_median_age: double (nullable = true)
 |-- total_rooms: double (nullable = true)
 |-- total_bedrooms: double (nullable = true)
 |-- population: double (nullable = true)
 |-- households: double (nullable = true)
 |-- median_income: double (nullable = true)
 |-- median_house_value: double (nullable = true)
 |-- ocean_proximity: string (nullable = true)
housing_data_train.show(10) # Primeros 10 registros del conjunto de datos
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|  -124.35|   40.54|              52.0|     1820.0|         300.0|     806.0|     270.0|       3.0147|           94600.0|     NEAR OCEAN|
|   -124.3|    41.8|              19.0|     2672.0|         552.0|    1298.0|     478.0|       1.9797|           85800.0|     NEAR OCEAN|
|  -124.27|   40.69|              36.0|     2349.0|         528.0|    1194.0|     465.0|       2.5179|           79000.0|     NEAR OCEAN|
|  -124.26|   40.58|              52.0|     2217.0|         394.0|     907.0|     369.0|       2.3571|          111400.0|     NEAR OCEAN|
|  -124.25|   40.28|              32.0|     1430.0|         419.0|     434.0|     187.0|       1.9417|           76100.0|     NEAR OCEAN|
|  -124.23|   40.81|              52.0|     1112.0|         209.0|     544.0|     172.0|       3.3462|           50800.0|     NEAR OCEAN|
|  -124.21|   40.75|              32.0|     1218.0|         331.0|     620.0|     268.0|       1.6528|           58100.0|     NEAR OCEAN|
|  -124.21|   41.75|              20.0|     3810.0|         787.0|    1993.0|     721.0|       2.0074|           66900.0|     NEAR OCEAN|
|  -124.21|   41.77|              17.0|     3461.0|         722.0|    1947.0|     647.0|       2.5795|           68400.0|     NEAR OCEAN|
|  -124.19|   41.78|              15.0|     3140.0|         714.0|    1645.0|     640.0|       1.6654|           74600.0|     NEAR OCEAN|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
only showing top 10 rows
(housing_data_train
 .describe() # Características estadísticas básicas
 .show())
+-------+-------------------+------------------+------------------+-----------------+-----------------+------------------+------------------+------------------+------------------+---------------+
|summary|          longitude|          latitude|housing_median_age|      total_rooms|   total_bedrooms|        population|        households|     median_income|median_house_value|ocean_proximity|
+-------+-------------------+------------------+------------------+-----------------+-----------------+------------------+------------------+------------------+------------------+---------------+
|  count|              14509|             14509|             14509|            14509|            14368|             14509|             14509|             14509|             14509|          14509|
|   mean|-119.59274932800152|35.660388724240136|28.668137018402373|2615.102212419877|534.5634047884187|1415.5490385278104|496.75215383555036|3.8453705562065017|205580.80998001242|           NULL|
| stddev| 2.0101381627905033|2.1443786967346115|12.563856962511244|2142.945353240265|415.2139447434203|1081.0649459905821| 377.2009098367398|1.8912964263124747|115026.89848308219|           NULL|
|    min|            -124.35|             32.54|               1.0|              2.0|              1.0|               3.0|               1.0|            0.4999|           14999.0|      <1H OCEAN|
|    max|            -114.31|             41.95|              52.0|          39320.0|           6210.0|           16305.0|            5358.0|           15.0001|          500001.0|     NEAR OCEAN|
+-------+-------------------+------------------+------------------+-----------------+-----------------+------------------+------------------+------------------+------------------+---------------+
housing_data_train.count() # Número de registros
14509
housing_data_pd.hist(bins=50, figsize=(12, 8)) # Calcular y graficar el histograma de cada característica
plt.show()
../../_images/8bddfc48fc214c7bb4f1bc38517076f0452f250c5fe0e1a053f6480bc5f4b6bc.png

Acerca del histograma:

  • Los ingresos no están representados en dólares (USD), están representados como un valor flotante entre ≃0.5 y ≃15. Al consultar con el equipo que recolectó los datos, este valor está expresado en decenas de miles de dólares, es decir, 3 corresponde a ingresos de aproximadamente $30.000 USD.

  • Los precios de las propiedades tienen un límite artificial de $500.000 USD, lo cual puede generar comportamientos indeseados en el modelo, que podría «entender» este comportamiento como que no hay propiedades de un mayor valor.

  • Las distribuciones de la mayoría de las características son asimétricas y no corresponden con una distribución Gaussiana, lo que puede tener un impacto negativo en el desempeño de algunos modelos.

  • Las características están representadas en escalas y rangos muy diferentes.

Datos geoespaciales#

housing_data_pd.plot(kind="scatter", x="longitude", y="latitude", grid=True)
plt.show()
../../_images/890841c157124a57d453ad6d71ed5cfed56df3629ae304b85260aec906f42d59.png
housing_data_pd.plot(kind="scatter", x="longitude", y="latitude", grid=True, alpha=0.2)
plt.show()
../../_images/9f69c82251c6c542af4fc92762c6b9755df18494601191e2bcadc52e2cdacfa3.png
housing_data_pd.plot(kind="scatter", x="longitude", y="latitude", grid=True,
                     s=housing_data_pd["population"] / 100, label="population",
                     c="median_house_value", cmap="jet", colorbar=True,
                     legend=True, sharex=False, figsize=(10, 7))
plt.show()
../../_images/f597c5968eb011443f945fbbd030830024e0b6221dc4d09dc1042f40b4f5198a.png
filename = "california.png"
homl3_root = "https://github.com/ageron/handson-ml3/raw/main/"
url = homl3_root + "images/end_to_end_project/" + filename
print("Downloading", filename)
urllib.request.urlretrieve(url, filename)

housing_renamed = housing_data_pd.rename(columns={"latitude": "Latitude", "longitude": "Longitude",
                                                  "population": "Population",
                                                  "median_house_value": "Median house value (ᴜsᴅ)"})

housing_renamed.plot(kind="scatter", x="Longitude", y="Latitude",
                     s=housing_renamed["Population"] / 100, label="Population",
                     c="Median house value (ᴜsᴅ)", cmap="jet", colorbar=True,
                     legend=True, sharex=False, figsize=(10, 7))

california_img = plt.imread(filename)
axis = -124.55, -113.95, 32.45, 42.05
plt.axis(axis)
plt.imshow(california_img, extent=axis)
plt.show()
Downloading california.png
../../_images/a4f7623c7da3fc29d530108405567c142b2a2f6f5a5f6cf737326dc4b02e506e.png

Correlaciones#

La correlación es una medida estadística que expresa hasta qué punto dos variables están relacionadas linealmente (esto es, cambian conjuntamente a una tasa constante). Es una herramienta común para describir relaciones simples sin hacer afirmaciones sobre causa y efecto [ref].

cols = housing_data_train.columns
target_col = "median_house_value"
cols.remove("ocean_proximity")
cols.remove(target_col)
print("Correlación entre las variables: ")
for col in cols:
  print(f"{target_col} - {col}: {housing_data_train.corr(target_col, col)}")
Correlación entre las variables: 
median_house_value - longitude: -0.048734489520877715
median_house_value - latitude: -0.14166014521476414
median_house_value - housing_median_age: 0.10585697542376424
median_house_value - total_rooms: 0.13932642838983347
median_house_value - total_bedrooms: 0.056133897510110844
median_house_value - population: -0.02148680786679524
median_house_value - households: 0.07041591413437558
median_house_value - median_income: 0.6836796788520056
caracteristicas = ["median_house_value", "median_income", "total_rooms", "housing_median_age"]
pd.plotting.scatter_matrix(housing_data_pd[caracteristicas], figsize=(12, 8))
plt.show()
../../_images/899d205a906225c1bcabc7cd40c54475a5a0714df9ad1e85dcfd673f1d9040e2.png

La variable de mayor correlación con nuestra caractarística objetivo (median_house_value) es median_income. Fijándonos en el gráfico entre estas dos características evidenciamos una tendencia clara «hacia arriba» y poca dispersión entre los puntos.

housing_data_pd.plot(kind="scatter", x="median_income", y="median_house_value", alpha=0.1, grid=True)
plt.show()
../../_images/2611e0fcce8e738cc3a84d1821d48a6982ce7f6557d2b82050946edc7c2b88e2.png

Creación de nuevos atributos#

housing_data_train2 = (housing_data_train
                      .withColumn("rooms_per_household", housing_data_train["total_rooms"]/housing_data_train["households"])
                      .withColumn("bedrooms_per_room", housing_data_train["total_bedrooms"]/housing_data_train["total_rooms"])
                      .withColumn("population_per_household", housing_data_train["population"]/housing_data_train["households"]))
new_cols = ["rooms_per_household", "bedrooms_per_room", "population_per_household"]
for col in new_cols:
  print(f"{target_col} - {col}: {housing_data_train2.corr(target_col, col)}")
median_house_value - rooms_per_household: 0.14836664719672774
median_house_value - bedrooms_per_room: -0.23387759463876323
median_house_value - population_per_household: -0.03004164488595351

Preparación de los datos#

Separación de características/predictores y etiquetas#

X_train = housing_data_train.drop("median_house_value")
y_train = housing_data_train.select("median_house_value")

X_test = housing_data_test.drop("median_house_value")
y_test = housing_data_test.select("median_house_value")

Tratamiento de datos nulos#

(X_train
 .describe()
 .show())
+-------+-------------------+------------------+------------------+-----------------+-----------------+------------------+------------------+------------------+---------------+
|summary|          longitude|          latitude|housing_median_age|      total_rooms|   total_bedrooms|        population|        households|     median_income|ocean_proximity|
+-------+-------------------+------------------+------------------+-----------------+-----------------+------------------+------------------+------------------+---------------+
|  count|              14509|             14509|             14509|            14509|            14368|             14509|             14509|             14509|          14509|
|   mean|-119.59274932800152|35.660388724240136|28.668137018402373|2615.102212419877|534.5634047884187|1415.5490385278104|496.75215383555036|3.8453705562065017|           NULL|
| stddev| 2.0101381627905033|2.1443786967346115|12.563856962511244|2142.945353240265|415.2139447434203|1081.0649459905821| 377.2009098367398|1.8912964263124747|           NULL|
|    min|            -124.35|             32.54|               1.0|              2.0|              1.0|               3.0|               1.0|            0.4999|      <1H OCEAN|
|    max|            -114.31|             41.95|              52.0|          39320.0|           6210.0|           16305.0|            5358.0|           15.0001|     NEAR OCEAN|
+-------+-------------------+------------------+------------------+-----------------+-----------------+------------------+------------------+------------------+---------------+

Opción 1: descartar los registros con datos nulos

X_train_valid = X_train.dropna(how = "any", subset = ["total_bedrooms"])
X_train_valid.count()
14368

Opción 2: descartar el atributo

X_train_ = X_train.drop("total_bedrooms")
print(X_train_.count())
print(X_train.columns)
14509
['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income', 'ocean_proximity']

Opción 3: al ser una característica numérica, se pueden reemplazar los valores faltantes con alguna medida estadística de esa característica.

imputer = (ml.feature.Imputer()
.setStrategy("median")
.setInputCols(["total_bedrooms"])
.setOutputCols(["total_bedrooms_complete"]))
model = imputer.fit(X_train)
model.surrogateDF.show()
+--------------+
|total_bedrooms|
+--------------+
|         433.0|
+--------------+
X_train_median = model.transform(X_train)
X_train_median.count()
14509

Características categóricas#

De manera general, los modelos o algoritmos de aprendizaje de máquina solo pueden ser entrenados con características numéricas. Las características de texto o categóricas deben ser convertidas a una representación numérica.

X_train.select("ocean_proximity").show(5)
+---------------+
|ocean_proximity|
+---------------+
|     NEAR OCEAN|
|     NEAR OCEAN|
|     NEAR OCEAN|
|     NEAR OCEAN|
|     NEAR OCEAN|
+---------------+
only showing top 5 rows
unique_ocean_prox = (X_train
                     .select("ocean_proximity")
                     .dropDuplicates()
                     .collect())
unique_ocean_prox = [item.ocean_proximity for item in unique_ocean_prox]
print(unique_ocean_prox)
['ISLAND', 'NEAR OCEAN', 'NEAR BAY', '<1H OCEAN', 'INLAND']

Opción 1: reemplazar los valores con un número entero

map_dict = {item: i for i, item in enumerate(unique_ocean_prox)}
print(map_dict)
{'ISLAND': 0, 'NEAR OCEAN': 1, 'NEAR BAY': 2, '<1H OCEAN': 3, 'INLAND': 4}
stringIndexer = ml.feature.StringIndexer(inputCol="ocean_proximity", outputCol="ordinal_ocean_proximity", stringOrderType="frequencyDesc")
stringIndexer.setHandleInvalid("error")
model = stringIndexer.fit(X_train)
X_train_ordinal = model.transform(X_train).drop("ocean_proximity")
X_train_ordinal.show()
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+-----------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|ordinal_ocean_proximity|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+-----------------------+
|  -124.35|   40.54|              52.0|     1820.0|         300.0|     806.0|     270.0|       3.0147|                    2.0|
|   -124.3|    41.8|              19.0|     2672.0|         552.0|    1298.0|     478.0|       1.9797|                    2.0|
|  -124.27|   40.69|              36.0|     2349.0|         528.0|    1194.0|     465.0|       2.5179|                    2.0|
|  -124.26|   40.58|              52.0|     2217.0|         394.0|     907.0|     369.0|       2.3571|                    2.0|
|  -124.25|   40.28|              32.0|     1430.0|         419.0|     434.0|     187.0|       1.9417|                    2.0|
|  -124.23|   40.81|              52.0|     1112.0|         209.0|     544.0|     172.0|       3.3462|                    2.0|
|  -124.21|   40.75|              32.0|     1218.0|         331.0|     620.0|     268.0|       1.6528|                    2.0|
|  -124.21|   41.75|              20.0|     3810.0|         787.0|    1993.0|     721.0|       2.0074|                    2.0|
|  -124.21|   41.77|              17.0|     3461.0|         722.0|    1947.0|     647.0|       2.5795|                    2.0|
|  -124.19|   41.78|              15.0|     3140.0|         714.0|    1645.0|     640.0|       1.6654|                    2.0|
|  -124.18|   40.62|              35.0|      952.0|         178.0|     480.0|     179.0|       3.0536|                    2.0|
|  -124.18|   40.78|              33.0|     1076.0|         222.0|     656.0|     236.0|       2.5096|                    2.0|
|  -124.18|   40.78|              37.0|     1453.0|         293.0|     867.0|     310.0|       2.5536|                    2.0|
|  -124.18|   40.79|              40.0|     1398.0|         311.0|     788.0|     279.0|       1.4668|                    2.0|
|  -124.17|   40.75|              13.0|     2171.0|         339.0|     951.0|     353.0|       4.8516|                    2.0|
|  -124.17|   40.76|              26.0|     1776.0|         361.0|     992.0|     380.0|       2.8056|                    2.0|
|  -124.17|   40.77|              30.0|     1895.0|         366.0|     990.0|     359.0|       2.2227|                    2.0|
|  -124.17|    40.8|              52.0|     1557.0|         344.0|     758.0|     319.0|       1.8529|                    2.0|
|  -124.17|   41.76|              20.0|     2673.0|         538.0|    1282.0|     514.0|       2.4605|                    2.0|
|  -124.16|   40.77|              35.0|     2141.0|         438.0|    1053.0|     434.0|       2.8529|                    2.0|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+-----------------------+
only showing top 20 rows

Opción 2: One-hot encoding

Esta segunda opción busca evitar un problema que se presenta con la opción 1 que es la de asignar implícitamente una jerarquía u orden a nuestra variable categórica, lo que en muchas ocasiones no representa la realidad.

@fct.udf(returnType=StringType())
def one_hot_encoding(x, map_dict = map_dict):
  values = list(map_dict.keys())
  one_hot_vector = np.zeros(len(values), dtype=np.uint8)
  one_hot_vector[map_dict.get(x)] = 1
  return ','.join([str(i) for i in one_hot_vector])
(X_train
 .withColumn("onehot_ocean_proximity", one_hot_encoding(X_train.ocean_proximity))
 .withColumn("onehot_ocean_ISLAND", fct.split("onehot_ocean_proximity", ",").getItem(0))
 .withColumn("onehot_ocean_NEAR_OCEAN", fct.split("onehot_ocean_proximity", ",").getItem(1))
 .withColumn("onehot_ocean_NEAR_BAY", fct.split("onehot_ocean_proximity", ",").getItem(2))
 .withColumn("onehot_ocean_1H_OCEAN", fct.split("onehot_ocean_proximity", ",").getItem(3))
 .withColumn("onehot_ocean_INLAND", fct.split("onehot_ocean_proximity", ",").getItem(4))
 #.where("ocean_proximity = 'NEAR BAY'")
 .show())
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+---------------+----------------------+-------------------+-----------------------+---------------------+---------------------+-------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|ocean_proximity|onehot_ocean_proximity|onehot_ocean_ISLAND|onehot_ocean_NEAR_OCEAN|onehot_ocean_NEAR_BAY|onehot_ocean_1H_OCEAN|onehot_ocean_INLAND|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+---------------+----------------------+-------------------+-----------------------+---------------------+---------------------+-------------------+
|  -124.35|   40.54|              52.0|     1820.0|         300.0|     806.0|     270.0|       3.0147|     NEAR OCEAN|             0,1,0,0,0|                  0|                      1|                    0|                    0|                  0|
|   -124.3|    41.8|              19.0|     2672.0|         552.0|    1298.0|     478.0|       1.9797|     NEAR OCEAN|             0,1,0,0,0|                  0|                      1|                    0|                    0|                  0|
|  -124.27|   40.69|              36.0|     2349.0|         528.0|    1194.0|     465.0|       2.5179|     NEAR OCEAN|             0,1,0,0,0|                  0|                      1|                    0|                    0|                  0|
|  -124.26|   40.58|              52.0|     2217.0|         394.0|     907.0|     369.0|       2.3571|     NEAR OCEAN|             0,1,0,0,0|                  0|                      1|                    0|                    0|                  0|
|  -124.25|   40.28|              32.0|     1430.0|         419.0|     434.0|     187.0|       1.9417|     NEAR OCEAN|             0,1,0,0,0|                  0|                      1|                    0|                    0|                  0|
|  -124.23|   40.81|              52.0|     1112.0|         209.0|     544.0|     172.0|       3.3462|     NEAR OCEAN|             0,1,0,0,0|                  0|                      1|                    0|                    0|                  0|
|  -124.21|   40.75|              32.0|     1218.0|         331.0|     620.0|     268.0|       1.6528|     NEAR OCEAN|             0,1,0,0,0|                  0|                      1|                    0|                    0|                  0|
|  -124.21|   41.75|              20.0|     3810.0|         787.0|    1993.0|     721.0|       2.0074|     NEAR OCEAN|             0,1,0,0,0|                  0|                      1|                    0|                    0|                  0|
|  -124.21|   41.77|              17.0|     3461.0|         722.0|    1947.0|     647.0|       2.5795|     NEAR OCEAN|             0,1,0,0,0|                  0|                      1|                    0|                    0|                  0|
|  -124.19|   41.78|              15.0|     3140.0|         714.0|    1645.0|     640.0|       1.6654|     NEAR OCEAN|             0,1,0,0,0|                  0|                      1|                    0|                    0|                  0|
|  -124.18|   40.62|              35.0|      952.0|         178.0|     480.0|     179.0|       3.0536|     NEAR OCEAN|             0,1,0,0,0|                  0|                      1|                    0|                    0|                  0|
|  -124.18|   40.78|              33.0|     1076.0|         222.0|     656.0|     236.0|       2.5096|     NEAR OCEAN|             0,1,0,0,0|                  0|                      1|                    0|                    0|                  0|
|  -124.18|   40.78|              37.0|     1453.0|         293.0|     867.0|     310.0|       2.5536|     NEAR OCEAN|             0,1,0,0,0|                  0|                      1|                    0|                    0|                  0|
|  -124.18|   40.79|              40.0|     1398.0|         311.0|     788.0|     279.0|       1.4668|     NEAR OCEAN|             0,1,0,0,0|                  0|                      1|                    0|                    0|                  0|
|  -124.17|   40.75|              13.0|     2171.0|         339.0|     951.0|     353.0|       4.8516|     NEAR OCEAN|             0,1,0,0,0|                  0|                      1|                    0|                    0|                  0|
|  -124.17|   40.76|              26.0|     1776.0|         361.0|     992.0|     380.0|       2.8056|     NEAR OCEAN|             0,1,0,0,0|                  0|                      1|                    0|                    0|                  0|
|  -124.17|   40.77|              30.0|     1895.0|         366.0|     990.0|     359.0|       2.2227|     NEAR OCEAN|             0,1,0,0,0|                  0|                      1|                    0|                    0|                  0|
|  -124.17|    40.8|              52.0|     1557.0|         344.0|     758.0|     319.0|       1.8529|     NEAR OCEAN|             0,1,0,0,0|                  0|                      1|                    0|                    0|                  0|
|  -124.17|   41.76|              20.0|     2673.0|         538.0|    1282.0|     514.0|       2.4605|     NEAR OCEAN|             0,1,0,0,0|                  0|                      1|                    0|                    0|                  0|
|  -124.16|   40.77|              35.0|     2141.0|         438.0|    1053.0|     434.0|       2.8529|     NEAR OCEAN|             0,1,0,0,0|                  0|                      1|                    0|                    0|                  0|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+---------------+----------------------+-------------------+-----------------------+---------------------+---------------------+-------------------+
only showing top 20 rows
onehotencoder = ml.feature.OneHotEncoder(inputCol="ordinal_ocean_proximity", outputCol="onehot_ocean_proximity")
onehotencoder.setHandleInvalid("error")
model = onehotencoder.fit(X_train_ordinal)
X_train_onehot = model.transform(X_train_ordinal).drop("ordinal_ocean_proximity")
X_train_onehot.show()
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+----------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|onehot_ocean_proximity|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+----------------------+
|  -124.35|   40.54|              52.0|     1820.0|         300.0|     806.0|     270.0|       3.0147|         (4,[2],[1.0])|
|   -124.3|    41.8|              19.0|     2672.0|         552.0|    1298.0|     478.0|       1.9797|         (4,[2],[1.0])|
|  -124.27|   40.69|              36.0|     2349.0|         528.0|    1194.0|     465.0|       2.5179|         (4,[2],[1.0])|
|  -124.26|   40.58|              52.0|     2217.0|         394.0|     907.0|     369.0|       2.3571|         (4,[2],[1.0])|
|  -124.25|   40.28|              32.0|     1430.0|         419.0|     434.0|     187.0|       1.9417|         (4,[2],[1.0])|
|  -124.23|   40.81|              52.0|     1112.0|         209.0|     544.0|     172.0|       3.3462|         (4,[2],[1.0])|
|  -124.21|   40.75|              32.0|     1218.0|         331.0|     620.0|     268.0|       1.6528|         (4,[2],[1.0])|
|  -124.21|   41.75|              20.0|     3810.0|         787.0|    1993.0|     721.0|       2.0074|         (4,[2],[1.0])|
|  -124.21|   41.77|              17.0|     3461.0|         722.0|    1947.0|     647.0|       2.5795|         (4,[2],[1.0])|
|  -124.19|   41.78|              15.0|     3140.0|         714.0|    1645.0|     640.0|       1.6654|         (4,[2],[1.0])|
|  -124.18|   40.62|              35.0|      952.0|         178.0|     480.0|     179.0|       3.0536|         (4,[2],[1.0])|
|  -124.18|   40.78|              33.0|     1076.0|         222.0|     656.0|     236.0|       2.5096|         (4,[2],[1.0])|
|  -124.18|   40.78|              37.0|     1453.0|         293.0|     867.0|     310.0|       2.5536|         (4,[2],[1.0])|
|  -124.18|   40.79|              40.0|     1398.0|         311.0|     788.0|     279.0|       1.4668|         (4,[2],[1.0])|
|  -124.17|   40.75|              13.0|     2171.0|         339.0|     951.0|     353.0|       4.8516|         (4,[2],[1.0])|
|  -124.17|   40.76|              26.0|     1776.0|         361.0|     992.0|     380.0|       2.8056|         (4,[2],[1.0])|
|  -124.17|   40.77|              30.0|     1895.0|         366.0|     990.0|     359.0|       2.2227|         (4,[2],[1.0])|
|  -124.17|    40.8|              52.0|     1557.0|         344.0|     758.0|     319.0|       1.8529|         (4,[2],[1.0])|
|  -124.17|   41.76|              20.0|     2673.0|         538.0|    1282.0|     514.0|       2.4605|         (4,[2],[1.0])|
|  -124.16|   40.77|              35.0|     2141.0|         438.0|    1053.0|     434.0|       2.8529|         (4,[2],[1.0])|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+----------------------+
only showing top 20 rows
X_train_onehot = (X_train
                  .withColumn("onehot_ocean_proximity", one_hot_encoding(X_train.ocean_proximity))
                  .withColumn("onehot_ocean_ISLAND", fct.split("onehot_ocean_proximity", ",").getItem(0).cast(IntegerType()))
                  .withColumn("onehot_ocean_NEAR_OCEAN", fct.split("onehot_ocean_proximity", ",").getItem(1).cast(IntegerType()))
                  .withColumn("onehot_ocean_NEAR_BAY", fct.split("onehot_ocean_proximity", ",").getItem(2).cast(IntegerType()))
                  .withColumn("onehot_ocean_1H_OCEAN", fct.split("onehot_ocean_proximity", ",").getItem(3).cast(IntegerType()))
                  .withColumn("onehot_ocean_INLAND", fct.split("onehot_ocean_proximity", ",").getItem(4).cast(IntegerType()))
                  .drop("ocean_proximity")
                  .drop("onehot_ocean_proximity"))
X_train_onehot.show()
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+-------------------+-----------------------+---------------------+---------------------+-------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|onehot_ocean_ISLAND|onehot_ocean_NEAR_OCEAN|onehot_ocean_NEAR_BAY|onehot_ocean_1H_OCEAN|onehot_ocean_INLAND|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+-------------------+-----------------------+---------------------+---------------------+-------------------+
|  -124.35|   40.54|              52.0|     1820.0|         300.0|     806.0|     270.0|       3.0147|                  0|                      1|                    0|                    0|                  0|
|   -124.3|    41.8|              19.0|     2672.0|         552.0|    1298.0|     478.0|       1.9797|                  0|                      1|                    0|                    0|                  0|
|  -124.27|   40.69|              36.0|     2349.0|         528.0|    1194.0|     465.0|       2.5179|                  0|                      1|                    0|                    0|                  0|
|  -124.26|   40.58|              52.0|     2217.0|         394.0|     907.0|     369.0|       2.3571|                  0|                      1|                    0|                    0|                  0|
|  -124.25|   40.28|              32.0|     1430.0|         419.0|     434.0|     187.0|       1.9417|                  0|                      1|                    0|                    0|                  0|
|  -124.23|   40.81|              52.0|     1112.0|         209.0|     544.0|     172.0|       3.3462|                  0|                      1|                    0|                    0|                  0|
|  -124.21|   40.75|              32.0|     1218.0|         331.0|     620.0|     268.0|       1.6528|                  0|                      1|                    0|                    0|                  0|
|  -124.21|   41.75|              20.0|     3810.0|         787.0|    1993.0|     721.0|       2.0074|                  0|                      1|                    0|                    0|                  0|
|  -124.21|   41.77|              17.0|     3461.0|         722.0|    1947.0|     647.0|       2.5795|                  0|                      1|                    0|                    0|                  0|
|  -124.19|   41.78|              15.0|     3140.0|         714.0|    1645.0|     640.0|       1.6654|                  0|                      1|                    0|                    0|                  0|
|  -124.18|   40.62|              35.0|      952.0|         178.0|     480.0|     179.0|       3.0536|                  0|                      1|                    0|                    0|                  0|
|  -124.18|   40.78|              33.0|     1076.0|         222.0|     656.0|     236.0|       2.5096|                  0|                      1|                    0|                    0|                  0|
|  -124.18|   40.78|              37.0|     1453.0|         293.0|     867.0|     310.0|       2.5536|                  0|                      1|                    0|                    0|                  0|
|  -124.18|   40.79|              40.0|     1398.0|         311.0|     788.0|     279.0|       1.4668|                  0|                      1|                    0|                    0|                  0|
|  -124.17|   40.75|              13.0|     2171.0|         339.0|     951.0|     353.0|       4.8516|                  0|                      1|                    0|                    0|                  0|
|  -124.17|   40.76|              26.0|     1776.0|         361.0|     992.0|     380.0|       2.8056|                  0|                      1|                    0|                    0|                  0|
|  -124.17|   40.77|              30.0|     1895.0|         366.0|     990.0|     359.0|       2.2227|                  0|                      1|                    0|                    0|                  0|
|  -124.17|    40.8|              52.0|     1557.0|         344.0|     758.0|     319.0|       1.8529|                  0|                      1|                    0|                    0|                  0|
|  -124.17|   41.76|              20.0|     2673.0|         538.0|    1282.0|     514.0|       2.4605|                  0|                      1|                    0|                    0|                  0|
|  -124.16|   40.77|              35.0|     2141.0|         438.0|    1053.0|     434.0|       2.8529|                  0|                      1|                    0|                    0|                  0|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+-------------------+-----------------------+---------------------+---------------------+-------------------+
only showing top 20 rows

Normalización#

X_train_minmax_scaled = X_train_valid.alias("X_train_minmax_scaled")
numeric_cols = ["longitude","latitude","housing_median_age","total_rooms","total_bedrooms","population","households","median_income"]
vec_numeric_cols = ["vec_"+col for col in numeric_cols]
vecsAssembler = []
for in_col, out_col in zip(numeric_cols, vec_numeric_cols):
  vecAssembler = ml.feature.VectorAssembler(outputCol=out_col, inputCols=[in_col])
  X_train_minmax_scaled = vecAssembler.transform(X_train_minmax_scaled)
  vecsAssembler.append(vecAssembler)
X_train_minmax_scaled.show()
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+---------------+-------------+------------+----------------------+---------------+------------------+--------------+--------------+-----------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|ocean_proximity|vec_longitude|vec_latitude|vec_housing_median_age|vec_total_rooms|vec_total_bedrooms|vec_population|vec_households|vec_median_income|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+---------------+-------------+------------+----------------------+---------------+------------------+--------------+--------------+-----------------+
|  -124.35|   40.54|              52.0|     1820.0|         300.0|     806.0|     270.0|       3.0147|     NEAR OCEAN|    [-124.35]|     [40.54]|                [52.0]|       [1820.0]|           [300.0]|       [806.0]|       [270.0]|         [3.0147]|
|   -124.3|    41.8|              19.0|     2672.0|         552.0|    1298.0|     478.0|       1.9797|     NEAR OCEAN|     [-124.3]|      [41.8]|                [19.0]|       [2672.0]|           [552.0]|      [1298.0]|       [478.0]|         [1.9797]|
|  -124.27|   40.69|              36.0|     2349.0|         528.0|    1194.0|     465.0|       2.5179|     NEAR OCEAN|    [-124.27]|     [40.69]|                [36.0]|       [2349.0]|           [528.0]|      [1194.0]|       [465.0]|         [2.5179]|
|  -124.26|   40.58|              52.0|     2217.0|         394.0|     907.0|     369.0|       2.3571|     NEAR OCEAN|    [-124.26]|     [40.58]|                [52.0]|       [2217.0]|           [394.0]|       [907.0]|       [369.0]|         [2.3571]|
|  -124.25|   40.28|              32.0|     1430.0|         419.0|     434.0|     187.0|       1.9417|     NEAR OCEAN|    [-124.25]|     [40.28]|                [32.0]|       [1430.0]|           [419.0]|       [434.0]|       [187.0]|         [1.9417]|
|  -124.23|   40.81|              52.0|     1112.0|         209.0|     544.0|     172.0|       3.3462|     NEAR OCEAN|    [-124.23]|     [40.81]|                [52.0]|       [1112.0]|           [209.0]|       [544.0]|       [172.0]|         [3.3462]|
|  -124.21|   40.75|              32.0|     1218.0|         331.0|     620.0|     268.0|       1.6528|     NEAR OCEAN|    [-124.21]|     [40.75]|                [32.0]|       [1218.0]|           [331.0]|       [620.0]|       [268.0]|         [1.6528]|
|  -124.21|   41.75|              20.0|     3810.0|         787.0|    1993.0|     721.0|       2.0074|     NEAR OCEAN|    [-124.21]|     [41.75]|                [20.0]|       [3810.0]|           [787.0]|      [1993.0]|       [721.0]|         [2.0074]|
|  -124.21|   41.77|              17.0|     3461.0|         722.0|    1947.0|     647.0|       2.5795|     NEAR OCEAN|    [-124.21]|     [41.77]|                [17.0]|       [3461.0]|           [722.0]|      [1947.0]|       [647.0]|         [2.5795]|
|  -124.19|   41.78|              15.0|     3140.0|         714.0|    1645.0|     640.0|       1.6654|     NEAR OCEAN|    [-124.19]|     [41.78]|                [15.0]|       [3140.0]|           [714.0]|      [1645.0]|       [640.0]|         [1.6654]|
|  -124.18|   40.62|              35.0|      952.0|         178.0|     480.0|     179.0|       3.0536|     NEAR OCEAN|    [-124.18]|     [40.62]|                [35.0]|        [952.0]|           [178.0]|       [480.0]|       [179.0]|         [3.0536]|
|  -124.18|   40.78|              33.0|     1076.0|         222.0|     656.0|     236.0|       2.5096|     NEAR OCEAN|    [-124.18]|     [40.78]|                [33.0]|       [1076.0]|           [222.0]|       [656.0]|       [236.0]|         [2.5096]|
|  -124.18|   40.78|              37.0|     1453.0|         293.0|     867.0|     310.0|       2.5536|     NEAR OCEAN|    [-124.18]|     [40.78]|                [37.0]|       [1453.0]|           [293.0]|       [867.0]|       [310.0]|         [2.5536]|
|  -124.18|   40.79|              40.0|     1398.0|         311.0|     788.0|     279.0|       1.4668|     NEAR OCEAN|    [-124.18]|     [40.79]|                [40.0]|       [1398.0]|           [311.0]|       [788.0]|       [279.0]|         [1.4668]|
|  -124.17|   40.75|              13.0|     2171.0|         339.0|     951.0|     353.0|       4.8516|     NEAR OCEAN|    [-124.17]|     [40.75]|                [13.0]|       [2171.0]|           [339.0]|       [951.0]|       [353.0]|         [4.8516]|
|  -124.17|   40.76|              26.0|     1776.0|         361.0|     992.0|     380.0|       2.8056|     NEAR OCEAN|    [-124.17]|     [40.76]|                [26.0]|       [1776.0]|           [361.0]|       [992.0]|       [380.0]|         [2.8056]|
|  -124.17|   40.77|              30.0|     1895.0|         366.0|     990.0|     359.0|       2.2227|     NEAR OCEAN|    [-124.17]|     [40.77]|                [30.0]|       [1895.0]|           [366.0]|       [990.0]|       [359.0]|         [2.2227]|
|  -124.17|    40.8|              52.0|     1557.0|         344.0|     758.0|     319.0|       1.8529|     NEAR OCEAN|    [-124.17]|      [40.8]|                [52.0]|       [1557.0]|           [344.0]|       [758.0]|       [319.0]|         [1.8529]|
|  -124.17|   41.76|              20.0|     2673.0|         538.0|    1282.0|     514.0|       2.4605|     NEAR OCEAN|    [-124.17]|     [41.76]|                [20.0]|       [2673.0]|           [538.0]|      [1282.0]|       [514.0]|         [2.4605]|
|  -124.16|   40.77|              35.0|     2141.0|         438.0|    1053.0|     434.0|       2.8529|     NEAR OCEAN|    [-124.16]|     [40.77]|                [35.0]|       [2141.0]|           [438.0]|      [1053.0]|       [434.0]|         [2.8529]|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+---------------+-------------+------------+----------------------+---------------+------------------+--------------+--------------+-----------------+
only showing top 20 rows
out_numeric_cols = ["scaled_"+col for col in numeric_cols]
minmaxScalers = []
for in_col, out_col in zip(vec_numeric_cols, out_numeric_cols):
  minmaxscaler = ml.feature.MinMaxScaler(outputCol=out_col, inputCol=in_col)
  model = minmaxscaler.fit(X_train_minmax_scaled)
  minmaxScalers.append(model)
  X_train_minmax_scaled = model.transform(X_train_minmax_scaled)
X_train_minmax_scaled.show()
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+---------------+-------------+------------+----------------------+---------------+------------------+--------------+--------------+-----------------+--------------------+--------------------+-------------------------+--------------------+---------------------+--------------------+--------------------+--------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|ocean_proximity|vec_longitude|vec_latitude|vec_housing_median_age|vec_total_rooms|vec_total_bedrooms|vec_population|vec_households|vec_median_income|    scaled_longitude|     scaled_latitude|scaled_housing_median_age|  scaled_total_rooms|scaled_total_bedrooms|   scaled_population|   scaled_households|scaled_median_income|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+---------------+-------------+------------+----------------------+---------------+------------------+--------------+--------------+-----------------+--------------------+--------------------+-------------------------+--------------------+---------------------+--------------------+--------------------+--------------------+
|  -124.35|   40.54|              52.0|     1820.0|         300.0|     806.0|     270.0|       3.0147|     NEAR OCEAN|    [-124.35]|     [40.54]|                [52.0]|       [1820.0]|           [300.0]|       [806.0]|       [270.0]|         [3.0147]|               [0.0]|[0.8501594048884162]|                    [1.0]|[0.04623836410804...| [0.04815590272185...|[0.04925775978407...|[0.05021467239126...|[0.17343209059185...|
|   -124.3|    41.8|              19.0|     2672.0|         552.0|    1298.0|     478.0|       1.9797|     NEAR OCEAN|     [-124.3]|      [41.8]|                [19.0]|       [2672.0]|           [552.0]|      [1298.0]|       [478.0]|         [1.9797]|[0.00498007968127...|[0.9840595111583416]|     [0.3529411764705882]|[0.0679078284755074]| [0.08874214849412...|[0.07943810575389...|[0.08904237446331...|[0.10205376477565...|
|  -124.27|   40.69|              36.0|     2349.0|         528.0|    1194.0|     465.0|       2.5179|     NEAR OCEAN|    [-124.27]|     [40.69]|                [36.0]|       [2349.0]|           [528.0]|      [1194.0]|       [465.0]|         [2.5179]|[0.00796812749003...|[0.8660998937300739]|     [0.6862745098039216]|[0.05969276158502...| [0.08487679175390...|[0.07305852042694...|[0.08661564308381...|  [0.13917049420008]|
|  -124.26|   40.58|              52.0|     2217.0|         394.0|     907.0|     369.0|       2.3571|     NEAR OCEAN|    [-124.26]|     [40.58]|                [52.0]|       [2217.0]|           [394.0]|       [907.0]|       [369.0]|         [2.3571]|[0.00896414342629...|[0.8544102019128582]|                    [1.0]|[0.05633552062668...| [0.06329521662103...|[0.05545331861121...|[0.06869516520440...|[0.12808099198631...|
|  -124.25|   40.28|              32.0|     1430.0|         419.0|     434.0|     187.0|       1.9417|     NEAR OCEAN|    [-124.25]|     [40.28]|                [32.0]|       [1430.0]|           [419.0]|       [434.0]|       [187.0]|         [1.9417]|[0.00996015936254...|[0.8225292242295429]|     [0.6078431372549019]|[0.03631924309476...| [0.06732162989209...|[0.02643847380689...|[0.03472092589135...|[0.0994331112674308]|
|  -124.23|   40.81|              52.0|     1112.0|         209.0|     544.0|     172.0|       3.3462|     NEAR OCEAN|    [-124.23]|     [40.81]|                [52.0]|       [1112.0]|           [209.0]|       [544.0]|       [172.0]|         [3.3462]|[0.01195219123505...|[0.8788522848034006]|                    [1.0]|[0.02823134442240...| [0.03349975841520...|[0.03318611213348...|[0.03192085122269...|[0.19629384422283...|
|  -124.21|   40.75|              32.0|     1218.0|         331.0|     620.0|     268.0|       1.6528|     NEAR OCEAN|    [-124.21]|     [40.75]|                [32.0]|       [1218.0]|           [331.0]|       [620.0]|       [268.0]|         [1.6528]|[0.01394422310756...|[0.8724760892667373]|     [0.6078431372549019]|[0.03092731064652...| [0.05314865517796...|[0.03784811679548...|[0.04984132910210...|[0.07950924814830...|
|  -124.21|   41.75|              20.0|     3810.0|         787.0|    1993.0|     721.0|       2.0074|     NEAR OCEAN|    [-124.21]|     [41.75]|                [20.0]|       [3810.0]|           [787.0]|      [1993.0]|       [721.0]|         [2.0074]|[0.01394422310756...|[0.9787460148777893]|     [0.37254901960784...|[0.09685131491937...| [0.12659043324206...|[0.12207091154459...|[0.1344035840955759]|[0.1039640832540241]|
|  -124.21|   41.77|              17.0|     3461.0|         722.0|    1947.0|     647.0|       2.5795|     NEAR OCEAN|    [-124.21]|     [41.77]|                [17.0]|       [3461.0]|           [722.0]|      [1947.0]|       [647.0]|         [2.5795]|[0.01394422310756...|[0.9808714133900106]|     [0.3137254901960784]|[0.0879749732946742]| [0.1161217587373168]|[0.11924917188075...|[0.12058988239686...|[0.14341871146604...|
|  -124.19|   41.78|              15.0|     3140.0|         714.0|    1645.0|     640.0|       1.6654|     NEAR OCEAN|    [-124.19]|     [41.78]|                [15.0]|       [3140.0]|           [714.0]|      [1645.0]|       [640.0]|         [1.6654]|[0.01593625498007...| [0.981934112646121]|     [0.27450980392156...|[0.07981077369143...| [0.1148333064905782]|[0.10072383756594...|[0.1192831808848236]|[0.08037820167997...|
|  -124.18|   40.62|              35.0|      952.0|         178.0|     480.0|     179.0|       3.0536|     NEAR OCEAN|    [-124.18]|     [40.62]|                [35.0]|        [952.0]|           [178.0]|       [480.0]|       [179.0]|         [3.0536]|[0.01693227091633...|[0.8586609989373002]|     [0.6666666666666666]|[0.02416196144259...| [0.02850700595909...|[0.02926021347073...|[0.03322755273473...|[0.1761148122094868]|
|  -124.18|   40.78|              33.0|     1076.0|         222.0|     656.0|     236.0|       2.5096|     NEAR OCEAN|    [-124.18]|     [40.78]|                [33.0]|       [1076.0]|           [222.0]|       [656.0]|       [236.0]|         [2.5096]|[0.01693227091633...| [0.875664187035069]|     [0.6274509803921569]|[0.02731573325194...| [0.03559349331615...|[0.0400564347932769]|[0.04386783647563...|[0.1385980883022303]|
|  -124.18|   40.78|              37.0|     1453.0|         293.0|     867.0|     310.0|       2.5536|     NEAR OCEAN|    [-124.18]|     [40.78]|                [37.0]|       [1453.0]|           [293.0]|       [867.0]|       [310.0]|         [2.5536]|[0.01693227091633...| [0.875664187035069]|     [0.7058823529411764]|[0.03690421689811...| [0.04702850700595...|[0.05299963194700...|[0.05768153817435...|[0.14163252920649...|
|  -124.18|   40.79|              40.0|     1398.0|         311.0|     788.0|     279.0|       1.4668|     NEAR OCEAN|    [-124.18]|     [40.79]|                [40.0]|       [1398.0]|           [311.0]|       [788.0]|       [279.0]|         [1.4668]|[0.01693227091633...|[0.8767268862911792]|     [0.7647058823529411]|[0.03550536649880...| [0.04992752456112...|[0.04815360078517...|[0.05189471719245...| [0.066681838871188]|
|  -124.17|   40.75|              13.0|     2171.0|         339.0|     951.0|     353.0|       4.8516|     NEAR OCEAN|    [-124.17]|     [40.75]|                [13.0]|       [2171.0]|           [339.0]|       [951.0]|       [353.0]|         [4.8516]|[0.01792828685258...|[0.8724760892667373]|     [0.23529411764705...|[0.05516557301999...| [0.05443710742470...|[0.05815237394184...|[0.06570841889117...|[0.30011310188824...|
|  -124.17|   40.76|              26.0|     1776.0|         361.0|     992.0|     380.0|       2.8056|     NEAR OCEAN|    [-124.17]|     [40.76]|                [26.0]|       [1776.0]|           [361.0]|       [992.0]|       [380.0]|         [2.8056]|[0.01792828685258...|[0.8735387885228476]|     [0.49019607843137...|[0.04511928378859...| [0.05798035110323...|[0.06066740277266...|[0.07074855329475...|[0.1590115998400022]|
|  -124.17|   40.77|              30.0|     1895.0|         366.0|     990.0|     359.0|       2.2227|     NEAR OCEAN|    [-124.17]|     [40.77]|                [30.0]|       [1895.0]|           [366.0]|       [990.0]|       [359.0]|         [2.2227]|[0.01792828685258...|[0.8746014877789586]|     [0.5686274509803921]|[0.04814588737982...| [0.05878563375744...|[0.06054471843945...|[0.06682844875863...|[0.11881215431511...|
|  -124.17|    40.8|              52.0|     1557.0|         344.0|     758.0|     319.0|       1.8529|     NEAR OCEAN|    [-124.17]|      [40.8]|                [52.0]|       [1557.0]|           [344.0]|       [758.0]|       [319.0]|         [1.8529]|[0.01792828685258...|[0.8777895855472896]|                    [1.0]|[0.03954931583498...| [0.0552423900789177]|  [0.04631333578702]|[0.05936158297554...|[0.09330905780609...|
|  -124.17|   41.76|              20.0|     2673.0|         538.0|    1282.0|     514.0|       2.4605|     NEAR OCEAN|    [-124.17]|     [41.76]|                [20.0]|       [2673.0]|           [538.0]|      [1282.0]|       [514.0]|         [2.4605]|[0.01792828685258...|[0.9798087141338996]|     [0.37254901960784...|[0.06793326211913...| [0.08648735706232...|[0.07845663108821...|[0.09576255366809...|[0.13521192811133...|
|  -124.16|   40.77|              35.0|     2141.0|         438.0|    1053.0|     434.0|       2.8529|     NEAR OCEAN|    [-124.16]|     [40.77]|                [35.0]|       [2141.0]|           [438.0]|      [1053.0]|       [434.0]|         [2.8529]|[0.01892430278884...|[0.8746014877789586]|     [0.6666666666666666]|[0.05440256371127...| [0.07038170397809...|[0.06440927493559...|[0.08082882210192...|[0.16227362381208...|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+---------------+-------------+------------+----------------------+---------------+------------------+--------------+--------------+-----------------+--------------------+--------------------+-------------------------+--------------------+---------------------+--------------------+--------------------+--------------------+
only showing top 20 rows
X_train_minmax_scaled = X_train_minmax_scaled.select('scaled_longitude', 'scaled_latitude', 'scaled_housing_median_age',
                                              'scaled_total_rooms', 'scaled_total_bedrooms', 'scaled_population',
                                              'scaled_households', 'scaled_median_income')
X_train_minmax_scaled.show()
+--------------------+--------------------+-------------------------+--------------------+---------------------+--------------------+--------------------+--------------------+
|    scaled_longitude|     scaled_latitude|scaled_housing_median_age|  scaled_total_rooms|scaled_total_bedrooms|   scaled_population|   scaled_households|scaled_median_income|
+--------------------+--------------------+-------------------------+--------------------+---------------------+--------------------+--------------------+--------------------+
|               [0.0]|[0.8501594048884162]|                    [1.0]|[0.04623836410804...| [0.04815590272185...|[0.04925775978407...|[0.05021467239126...|[0.17343209059185...|
|[0.00498007968127...|[0.9840595111583416]|     [0.3529411764705882]|[0.0679078284755074]| [0.08874214849412...|[0.07943810575389...|[0.08904237446331...|[0.10205376477565...|
|[0.00796812749003...|[0.8660998937300739]|     [0.6862745098039216]|[0.05969276158502...| [0.08487679175390...|[0.07305852042694...|[0.08661564308381...|  [0.13917049420008]|
|[0.00896414342629...|[0.8544102019128582]|                    [1.0]|[0.05633552062668...| [0.06329521662103...|[0.05545331861121...|[0.06869516520440...|[0.12808099198631...|
|[0.00996015936254...|[0.8225292242295429]|     [0.6078431372549019]|[0.03631924309476...| [0.06732162989209...|[0.02643847380689...|[0.03472092589135...|[0.0994331112674308]|
|[0.01195219123505...|[0.8788522848034006]|                    [1.0]|[0.02823134442240...| [0.03349975841520...|[0.03318611213348...|[0.03192085122269...|[0.19629384422283...|
|[0.01394422310756...|[0.8724760892667373]|     [0.6078431372549019]|[0.03092731064652...| [0.05314865517796...|[0.03784811679548...|[0.04984132910210...|[0.07950924814830...|
|[0.01394422310756...|[0.9787460148777893]|     [0.37254901960784...|[0.09685131491937...| [0.12659043324206...|[0.12207091154459...|[0.1344035840955759]|[0.1039640832540241]|
|[0.01394422310756...|[0.9808714133900106]|     [0.3137254901960784]|[0.0879749732946742]| [0.1161217587373168]|[0.11924917188075...|[0.12058988239686...|[0.14341871146604...|
|[0.01593625498007...| [0.981934112646121]|     [0.27450980392156...|[0.07981077369143...| [0.1148333064905782]|[0.10072383756594...|[0.1192831808848236]|[0.08037820167997...|
|[0.01693227091633...|[0.8586609989373002]|     [0.6666666666666666]|[0.02416196144259...| [0.02850700595909...|[0.02926021347073...|[0.03322755273473...|[0.1761148122094868]|
|[0.01693227091633...| [0.875664187035069]|     [0.6274509803921569]|[0.02731573325194...| [0.03559349331615...|[0.0400564347932769]|[0.04386783647563...|[0.1385980883022303]|
|[0.01693227091633...| [0.875664187035069]|     [0.7058823529411764]|[0.03690421689811...| [0.04702850700595...|[0.05299963194700...|[0.05768153817435...|[0.14163252920649...|
|[0.01693227091633...|[0.8767268862911792]|     [0.7647058823529411]|[0.03550536649880...| [0.04992752456112...|[0.04815360078517...|[0.05189471719245...| [0.066681838871188]|
|[0.01792828685258...|[0.8724760892667373]|     [0.23529411764705...|[0.05516557301999...| [0.05443710742470...|[0.05815237394184...|[0.06570841889117...|[0.30011310188824...|
|[0.01792828685258...|[0.8735387885228476]|     [0.49019607843137...|[0.04511928378859...| [0.05798035110323...|[0.06066740277266...|[0.07074855329475...|[0.1590115998400022]|
|[0.01792828685258...|[0.8746014877789586]|     [0.5686274509803921]|[0.04814588737982...| [0.05878563375744...|[0.06054471843945...|[0.06682844875863...|[0.11881215431511...|
|[0.01792828685258...|[0.8777895855472896]|                    [1.0]|[0.03954931583498...| [0.0552423900789177]|  [0.04631333578702]|[0.05936158297554...|[0.09330905780609...|
|[0.01792828685258...|[0.9798087141338996]|     [0.37254901960784...|[0.06793326211913...| [0.08648735706232...|[0.07845663108821...|[0.09576255366809...|[0.13521192811133...|
|[0.01892430278884...|[0.8746014877789586]|     [0.6666666666666666]|[0.05440256371127...| [0.07038170397809...|[0.06440927493559...|[0.08082882210192...|[0.16227362381208...|
+--------------------+--------------------+-------------------------+--------------------+---------------------+--------------------+--------------------+--------------------+
only showing top 20 rows

Creación del Pipeline (Secuencia de operaciones)#

imputer = ml.feature.Imputer(strategy="median",inputCols=["total_bedrooms"],outputCols=["total_bedrooms_complete"])

stringIndexer = ml.feature.StringIndexer(inputCol="ocean_proximity", outputCol="ordinal_ocean_proximity", stringOrderType="frequencyDesc")

onehotencoder = ml.feature.OneHotEncoder(inputCol="ordinal_ocean_proximity", outputCol="onehot_ocean_proximity")

columns_to_scale = ["longitude","latitude","housing_median_age","total_rooms","total_bedrooms_complete","population","households","median_income"]

assemblers = [ml.feature.VectorAssembler(inputCols=[col], outputCol=col + "_vec") for col in columns_to_scale+["median_house_value"]]

scalers = [ml.feature.MinMaxScaler(inputCol=col + "_vec", outputCol="scaled_" + col) for col in columns_to_scale]

feature_assembler = ml.feature.VectorAssembler(inputCols=["scaled_" + col for col in columns_to_scale]+["onehot_ocean_proximity"], outputCol="features")

sqlTrans = ml.feature.SQLTransformer(statement="SELECT features, median_house_value_vec AS label FROM __THIS__")

preprocess_pipeline = ml.Pipeline(stages=[imputer, stringIndexer, onehotencoder]+assemblers+scalers+[feature_assembler, sqlTrans])
pipeline_model = preprocess_pipeline.fit(housing_data_train)
pipeline_model.transform(housing_data_train).show()