Integración de datos con Apache Spark#

De manera general, la integración de datos se refiere a realizar consultas para las cuales los datos están divididos en distintas tablas, o incluso almacenados en distintas máquinas. En la vida real, se presentan casos complejos en los que los esquemas de las tablas no son exactamente iguales, por lo que hay que hacer un trabajo previo de correspondencia entre los atributos. En este notebook trabajaremos con un ejemplo simple de integración de datos usando el dataset MovieLens

Configuración del ambiente en Google Colaboratory#

# Descargar Java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
# A continuación, instalaremos Apache Spark 3.0.1 con Hadoop 2.7 desde aquí.
!wget https://dlcdn.apache.org/spark/spark-3.5.6/spark-3.5.6-bin-hadoop3.tgz
# Ahora, sólo tenemos que descomprimir esa carpeta.
!tar xf spark-3.5.6-bin-hadoop3.tgz
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.6-bin-hadoop3"
# Instalación de los paquetes necesarios
!pip install pyspark
!pip install findspark
--2025-10-07 13:31:27--  https://dlcdn.apache.org/spark/spark-3.5.6/spark-3.5.6-bin-hadoop3.tgz
Resolving dlcdn.apache.org (dlcdn.apache.org)... 151.101.2.132, 2a04:4e42::644
Connecting to dlcdn.apache.org (dlcdn.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 400923510 (382M) [application/x-gzip]
Saving to: ‘spark-3.5.6-bin-hadoop3.tgz’

spark-3.5.6-bin-had 100%[===================>] 382.35M   271MB/s    in 1.4s    

2025-10-07 13:31:28 (271 MB/s) - ‘spark-3.5.6-bin-hadoop3.tgz’ saved [400923510/400923510]

Requirement already satisfied: pyspark in /usr/local/lib/python3.12/dist-packages (3.5.1)
Requirement already satisfied: py4j==0.10.9.7 in /usr/local/lib/python3.12/dist-packages (from pyspark) (0.10.9.7)
Collecting findspark
  Downloading findspark-2.0.1-py2.py3-none-any.whl.metadata (352 bytes)
Downloading findspark-2.0.1-py2.py3-none-any.whl (4.4 kB)
Installing collected packages: findspark
Successfully installed findspark-2.0.1
# Importamos las librerias necesarias
import numpy as np
import datetime as dt
#import matplotlib.pyplot as plt

import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql import functions as fct
# Descargamos el dataset
!wget -q https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
!unzip ml-latest-small.zip
%cd ml-latest-small
Archive:  ml-latest-small.zip
   creating: ml-latest-small/
  inflating: ml-latest-small/links.csv  
  inflating: ml-latest-small/tags.csv  
  inflating: ml-latest-small/ratings.csv  
  inflating: ml-latest-small/README.txt  
  inflating: ml-latest-small/movies.csv  
/content/ml-latest-small
!echo "Lista de archivos en el dataset: "
!ls
Lista de archivos en el dataset: 
links.csv  movies.csv  ratings.csv  README.txt	tags.csv
!cat README.txt
Summary
=======

This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from [MovieLens](http://movielens.org), a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in the files `links.csv`, `movies.csv`, `ratings.csv` and `tags.csv`. More details about the contents and use of all these files follows.

This is a *development* dataset. As such, it may change over time and is not an appropriate dataset for shared research results. See available *benchmark* datasets if that is your intent.

This and other GroupLens data sets are publicly available for download at <http://grouplens.org/datasets/>.


Usage License
=============

Neither the University of Minnesota nor any of the researchers involved can guarantee the correctness of the data, its suitability for any particular purpose, or the validity of results based on the use of the data set. The data set may be used for any research purposes under the following conditions:

* The user may not state or imply any endorsement from the University of Minnesota or the GroupLens Research Group.
* The user must acknowledge the use of the data set in publications resulting from the use of the data set (see below for citation information).
* The user may redistribute the data set, including transformations, so long as it is distributed under these same license conditions.
* The user may not use this information for any commercial or revenue-bearing purposes without first obtaining permission from a faculty member of the GroupLens Research Project at the University of Minnesota.
* The executable software scripts are provided "as is" without warranty of any kind, either expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. The entire risk as to the quality and performance of them is with you. Should the program prove defective, you assume the cost of all necessary servicing, repair or correction.

In no event shall the University of Minnesota, its affiliates or employees be liable to you for any damages arising out of the use or inability to use these programs (including but not limited to loss of data or data being rendered inaccurate).

If you have any further questions or comments, please email <grouplens-info@umn.edu>


Citation
========

To acknowledge use of the dataset in publications, please cite the following paper:

> F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. <https://doi.org/10.1145/2827872>


Further Information About GroupLens
===================================

GroupLens is a research group in the Department of Computer Science and Engineering at the University of Minnesota. Since its inception in 1992, GroupLens's research projects have explored a variety of fields including:

* recommender systems
* online communities
* mobile and ubiquitious technologies
* digital libraries
* local geographic information systems

GroupLens Research operates a movie recommender based on collaborative filtering, MovieLens, which is the source of these data. We encourage you to visit <http://movielens.org> to try it out! If you have exciting ideas for experimental work to conduct on MovieLens, send us an email at <grouplens-info@cs.umn.edu> - we are always interested in working with external collaborators.


Content and Use of Files
========================

Formatting and Encoding
-----------------------

The dataset files are written as [comma-separated values](http://en.wikipedia.org/wiki/Comma-separated_values) files with a single header row. Columns that contain commas (`,`) are escaped using double-quotes (`"`). These files are encoded as UTF-8. If accented characters in movie titles or tag values (e.g. Misérables, Les (1995)) display incorrectly, make sure that any program reading the data, such as a text editor, terminal, or script, is configured for UTF-8.


User Ids
--------

MovieLens users were selected at random for inclusion. Their ids have been anonymized. User ids are consistent between `ratings.csv` and `tags.csv` (i.e., the same id refers to the same user across the two files).


Movie Ids
---------

Only movies with at least one rating or tag are included in the dataset. These movie ids are consistent with those used on the MovieLens web site (e.g., id `1` corresponds to the URL <https://movielens.org/movies/1>). Movie ids are consistent between `ratings.csv`, `tags.csv`, `movies.csv`, and `links.csv` (i.e., the same id refers to the same movie across these four data files).


Ratings Data File Structure (ratings.csv)
-----------------------------------------

All ratings are contained in the file `ratings.csv`. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:

    userId,movieId,rating,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.

Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.


Tags Data File Structure (tags.csv)
-----------------------------------

All tags are contained in the file `tags.csv`. Each line of this file after the header row represents one tag applied to one movie by one user, and has the following format:

    userId,movieId,tag,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId.

Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.


Movies Data File Structure (movies.csv)
---------------------------------------

Movie information is contained in the file `movies.csv`. Each line of this file after the header row represents one movie, and has the following format:

    movieId,title,genres

Movie titles are entered manually or imported from <https://www.themoviedb.org/>, and include the year of release in parentheses. Errors and inconsistencies may exist in these titles.

Genres are a pipe-separated list, and are selected from the following:

* Action
* Adventure
* Animation
* Children's
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-Noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western
* (no genres listed)


Links Data File Structure (links.csv)
---------------------------------------

Identifiers that can be used to link to other sources of movie data are contained in the file `links.csv`. Each line of this file after the header row represents one movie, and has the following format:

    movieId,imdbId,tmdbId

movieId is an identifier for movies used by <https://movielens.org>. E.g., the movie Toy Story has the link <https://movielens.org/movies/1>.

imdbId is an identifier for movies used by <http://www.imdb.com>. E.g., the movie Toy Story has the link <http://www.imdb.com/title/tt0114709/>.

tmdbId is an identifier for movies used by <https://www.themoviedb.org>. E.g., the movie Toy Story has the link <https://www.themoviedb.org/movie/862>.

Use of the resources listed above is subject to the terms of each provider.


Cross-Validation
----------------

Prior versions of the MovieLens dataset included either pre-computed cross-folds or scripts to perform this computation. We no longer bundle either of these features with the dataset, since most modern toolkits provide this as a built-in feature. If you wish to learn about standard approaches to cross-fold computation in the context of recommender systems evaluation, see [LensKit](http://lenskit.org) for tools, documentation, and open-source code examples.
# Iniciamos la sesión de Spark
ss = (SparkSession
      .builder
      .appName("data_integration")
      .getOrCreate())

Lectura de los datos#

# Información sobre las películas
!head movies.csv
movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy
6,Heat (1995),Action|Crime|Thriller
7,Sabrina (1995),Comedy|Romance
8,Tom and Huck (1995),Adventure|Children
9,Sudden Death (1995),Action
movies_data = ss.read.csv('/content/ml-latest-small/movies.csv', sep=',',
                          header=True, quote='"',
                          schema='movieId INT, title STRING, genres STRING')
movies_data.printSchema()
movies_data.head(5)
root
 |-- movieId: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- genres: string (nullable = true)
[Row(movieId=1, title='Toy Story (1995)', genres='Adventure|Animation|Children|Comedy|Fantasy'),
 Row(movieId=2, title='Jumanji (1995)', genres='Adventure|Children|Fantasy'),
 Row(movieId=3, title='Grumpier Old Men (1995)', genres='Comedy|Romance'),
 Row(movieId=4, title='Waiting to Exhale (1995)', genres='Comedy|Drama|Romance'),
 Row(movieId=5, title='Father of the Bride Part II (1995)', genres='Comedy')]
# Información sobre las valoraciones de los usuarios
!head ratings.csv
userId,movieId,rating,timestamp
1,1,4.0,964982703
1,3,4.0,964981247
1,6,4.0,964982224
1,47,5.0,964983815
1,50,5.0,964982931
1,70,3.0,964982400
1,101,5.0,964980868
1,110,4.0,964982176
1,151,5.0,964984041
ratings_data = ss.read.csv('/content/ml-latest-small/ratings.csv', sep=',',
                           header=True, quote='"',
                           schema='userId INT, movieId INT, rating DOUBLE, timestamp INT')
ratings_data.printSchema()
ratings_data.head(5)
root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- timestamp: integer (nullable = true)
[Row(userId=1, movieId=1, rating=4.0, timestamp=964982703),
 Row(userId=1, movieId=3, rating=4.0, timestamp=964981247),
 Row(userId=1, movieId=6, rating=4.0, timestamp=964982224),
 Row(userId=1, movieId=47, rating=5.0, timestamp=964983815),
 Row(userId=1, movieId=50, rating=5.0, timestamp=964982931)]
# Información sobre los tags creados por el usuario
!head tags.csv
userId,movieId,tag,timestamp
2,60756,funny,1445714994
2,60756,Highly quotable,1445714996
2,60756,will ferrell,1445714992
2,89774,Boxing story,1445715207
2,89774,MMA,1445715200
2,89774,Tom Hardy,1445715205
2,106782,drugs,1445715054
2,106782,Leonardo DiCaprio,1445715051
2,106782,Martin Scorsese,1445715056
tags_data = ss.read.csv('/content/ml-latest-small/tags.csv', sep=',',
                        header=True, quote='"',schema='userId INT, movieId INT, tag STRING, timestamp INT')
tags_data.printSchema()
tags_data.head(5)
root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- tag: string (nullable = true)
 |-- timestamp: integer (nullable = true)
[Row(userId=2, movieId=60756, tag='funny', timestamp=1445714994),
 Row(userId=2, movieId=60756, tag='Highly quotable', timestamp=1445714996),
 Row(userId=2, movieId=60756, tag='will ferrell', timestamp=1445714992),
 Row(userId=2, movieId=89774, tag='Boxing story', timestamp=1445715207),
 Row(userId=2, movieId=89774, tag='MMA', timestamp=1445715200)]

Conversión de las etiquetas de tiempo#

# Obtener el año a partir de la etiqueta de tiempo (timestamp)
get_year = fct.udf(lambda x: dt.datetime.fromtimestamp(x/1000.0).year)
# Almacenar el año en una nueva columna del DataFrame
ratings_data = ratings_data.withColumn("year", get_year(ratings_data.timestamp))
tags_data = tags_data.withColumn("year", get_year(tags_data.timestamp))
# Primeras filas de los DataFrame de valoraciones y etiquetas
ratings_data.show(5)
+------+-------+------+---------+----+
|userId|movieId|rating|timestamp|year|
+------+-------+------+---------+----+
|     1|      1|   4.0|964982703|1970|
|     1|      3|   4.0|964981247|1970|
|     1|      6|   4.0|964982224|1970|
|     1|     47|   5.0|964983815|1970|
|     1|     50|   5.0|964982931|1970|
+------+-------+------+---------+----+
only showing top 5 rows

Información básica de los datos#

movies_data.show(6)
+-------+--------------------+--------------------+
|movieId|               title|              genres|
+-------+--------------------+--------------------+
|      1|    Toy Story (1995)|Adventure|Animati...|
|      2|      Jumanji (1995)|Adventure|Childre...|
|      3|Grumpier Old Men ...|      Comedy|Romance|
|      4|Waiting to Exhale...|Comedy|Drama|Romance|
|      5|Father of the Bri...|              Comedy|
|      6|         Heat (1995)|Action|Crime|Thri...|
+-------+--------------------+--------------------+
only showing top 6 rows
res = (movies_data
       .select('movieId')
       .dropDuplicates()
       .count())
print(f'{res} películas')
9742 películas
ratings_data.show(6)
+------+-------+------+---------+----+
|userId|movieId|rating|timestamp|year|
+------+-------+------+---------+----+
|     1|      1|   4.0|964982703|1970|
|     1|      3|   4.0|964981247|1970|
|     1|      6|   4.0|964982224|1970|
|     1|     47|   5.0|964983815|1970|
|     1|     50|   5.0|964982931|1970|
|     1|     70|   3.0|964982400|1970|
+------+-------+------+---------+----+
only showing top 6 rows
res = (ratings_data
       .count())
print(f'{res} valoraciones')
100836 valoraciones
res = (ratings_data
       .select('userId')
       .dropDuplicates()
       .count())
print(f'{res} usuarios con valoraciones para al menos una película')
610 usuarios con valoraciones para al menos una película
print('Distribución de las valoraciones:')
(ratings_data
 .groupby('rating')
 .count()
 .sort(fct.desc('rating'))
 .show())
Distribución de las valoraciones:
+------+-----+
|rating|count|
+------+-----+
|   5.0|13211|
|   4.5| 8551|
|   4.0|26818|
|   3.5|13136|
|   3.0|20047|
|   2.5| 5550|
|   2.0| 7551|
|   1.5| 1791|
|   1.0| 2811|
|   0.5| 1370|
+------+-----+
print('Datos originales: ')
(movies_data
 .show(1))
print('Modificación sobre la columna género: ')
movies_data_genre = (movies_data
                     .withColumn("genres_array", fct.split("genres", "\|")) # Generar una lista con los diferentes géneros ["niños","animacion"]
                     .withColumn("genre", fct.explode("genres_array")) # Generar un "nuevo" registro por cada género de una película
                     .select("movieId", "title", "genre"))
movies_data_genre.show(5)
print('Géneros únicos de películas: ')
(movies_data
 .withColumn("genres_array", fct.split("genres", "\|"))
 .withColumn("genre", fct.explode("genres_array"))
 .select("genre")
 .dropDuplicates()
 .sort(fct.asc("genre"))
 .show())
<>:6: SyntaxWarning: invalid escape sequence '\|'
<>:12: SyntaxWarning: invalid escape sequence '\|'
<>:6: SyntaxWarning: invalid escape sequence '\|'
<>:12: SyntaxWarning: invalid escape sequence '\|'
/tmp/ipython-input-235001009.py:6: SyntaxWarning: invalid escape sequence '\|'
  .withColumn("genres_array", fct.split("genres", "\|")) # Generar una lista con los diferentes géneros ["niños","animacion"]
/tmp/ipython-input-235001009.py:12: SyntaxWarning: invalid escape sequence '\|'
  .withColumn("genres_array", fct.split("genres", "\|"))
Datos originales: 
+-------+----------------+--------------------+
|movieId|           title|              genres|
+-------+----------------+--------------------+
|      1|Toy Story (1995)|Adventure|Animati...|
+-------+----------------+--------------------+
only showing top 1 row

Modificación sobre la columna género: 
+-------+----------------+---------+
|movieId|           title|    genre|
+-------+----------------+---------+
|      1|Toy Story (1995)|Adventure|
|      1|Toy Story (1995)|Animation|
|      1|Toy Story (1995)| Children|
|      1|Toy Story (1995)|   Comedy|
|      1|Toy Story (1995)|  Fantasy|
+-------+----------------+---------+
only showing top 5 rows

Géneros únicos de películas: 
+------------------+
|             genre|
+------------------+
|(no genres listed)|
|            Action|
|         Adventure|
|         Animation|
|          Children|
|            Comedy|
|             Crime|
|       Documentary|
|             Drama|
|           Fantasy|
|         Film-Noir|
|            Horror|
|              IMAX|
|           Musical|
|           Mystery|
|           Romance|
|            Sci-Fi|
|          Thriller|
|               War|
|           Western|
+------------------+

Combinaciones [ref]#

Combinaciones internas (Inner Joins)#

Una combinación interna es una en la que solo se incluyen datos de una tabla si hay datos correspondientes en la tabla relacionada y viceversa.

movies_data_inner = movies_data.join(ratings_data, ["movieId"], "inner") #La tabla de las peliculas(movies_data) con la tabla de valoraciones (rating_data) se combina el movieID con su valoracion Inner
print(f'{movies_data_inner.count()} filas con una unión interna')
movies_data_inner.show()
100836 filas con una unión interna
+-------+--------------------+--------------------+------+------+---------+----+
|movieId|               title|              genres|userId|rating|timestamp|year|
+-------+--------------------+--------------------+------+------+---------+----+
|      1|    Toy Story (1995)|Adventure|Animati...|     1|   4.0|964982703|1970|
|      3|Grumpier Old Men ...|      Comedy|Romance|     1|   4.0|964981247|1970|
|      6|         Heat (1995)|Action|Crime|Thri...|     1|   4.0|964982224|1970|
|     47|Seven (a.k.a. Se7...|    Mystery|Thriller|     1|   5.0|964983815|1970|
|     50|Usual Suspects, T...|Crime|Mystery|Thr...|     1|   5.0|964982931|1970|
|     70|From Dusk Till Da...|Action|Comedy|Hor...|     1|   3.0|964982400|1970|
|    101|Bottle Rocket (1996)|Adventure|Comedy|...|     1|   5.0|964980868|1970|
|    110|   Braveheart (1995)|    Action|Drama|War|     1|   4.0|964982176|1970|
|    151|      Rob Roy (1995)|Action|Drama|Roma...|     1|   5.0|964984041|1970|
|    157|Canadian Bacon (1...|          Comedy|War|     1|   5.0|964984100|1970|
|    163|    Desperado (1995)|Action|Romance|We...|     1|   5.0|964983650|1970|
|    216|Billy Madison (1995)|              Comedy|     1|   5.0|964981208|1970|
|    223|       Clerks (1994)|              Comedy|     1|   3.0|964980985|1970|
|    231|Dumb & Dumber (Du...|    Adventure|Comedy|     1|   5.0|964981179|1970|
|    235|      Ed Wood (1994)|        Comedy|Drama|     1|   4.0|964980908|1970|
|    260|Star Wars: Episod...|Action|Adventure|...|     1|   5.0|964981680|1970|
|    296| Pulp Fiction (1994)|Comedy|Crime|Dram...|     1|   3.0|964982967|1970|
|    316|     Stargate (1994)|Action|Adventure|...|     1|   3.0|964982310|1970|
|    333|    Tommy Boy (1995)|              Comedy|     1|   5.0|964981179|1970|
|    349|Clear and Present...|Action|Crime|Dram...|     1|   4.0|964982563|1970|
+-------+--------------------+--------------------+------+------+---------+----+
only showing top 20 rows

Combinaciones externas (Outer Joins)#

Una combinación externa es como una combinación interna, pero agrega las filas restantes de una de las tablas. Las combinaciones externas son direccionales: una combinación externa izquierda incluye todos los registros de la tabla izquierda (la primera tabla de la combinación) y una combinación externa derecha incluye todos los registros de la tabla derecha, la segunda tabla de la combinación.

# LEFT OUTER
movies_data_outer_l = movies_data.join(ratings_data, ["movieId"], "left")
print(f'{movies_data_outer_l.count()} filas con una unión externa (izq)')
movies_data_outer_l.show()
100854 filas con una unión externa (izq)
+-------+----------------+--------------------+------+------+----------+----+
|movieId|           title|              genres|userId|rating| timestamp|year|
+-------+----------------+--------------------+------+------+----------+----+
|      1|Toy Story (1995)|Adventure|Animati...|   610|   5.0|1479542900|1970|
|      1|Toy Story (1995)|Adventure|Animati...|   609|   3.0| 847221025|1970|
|      1|Toy Story (1995)|Adventure|Animati...|   608|   2.5|1117408267|1970|
|      1|Toy Story (1995)|Adventure|Animati...|   607|   4.0| 964744033|1970|
|      1|Toy Story (1995)|Adventure|Animati...|   606|   2.5|1349082950|1970|
|      1|Toy Story (1995)|Adventure|Animati...|   605|   4.0|1277097561|1970|
|      1|Toy Story (1995)|Adventure|Animati...|   604|   3.0| 832079851|1970|
|      1|Toy Story (1995)|Adventure|Animati...|   603|   4.0| 963178147|1970|
|      1|Toy Story (1995)|Adventure|Animati...|   601|   4.0|1521467801|1970|
|      1|Toy Story (1995)|Adventure|Animati...|   600|   2.5|1237764347|1970|
|      1|Toy Story (1995)|Adventure|Animati...|   599|   3.0|1498524204|1970|
|      1|Toy Story (1995)|Adventure|Animati...|   597|   4.0| 941557863|1970|
|      1|Toy Story (1995)|Adventure|Animati...|   596|   4.0|1535709666|1970|
|      1|Toy Story (1995)|Adventure|Animati...|   590|   4.0|1258420408|1970|
|      1|Toy Story (1995)|Adventure|Animati...|   587|   5.0| 953137847|1970|
|      1|Toy Story (1995)|Adventure|Animati...|   584|   5.0| 834987643|1970|
|      1|Toy Story (1995)|Adventure|Animati...|   580|   3.0|1167792349|1970|
|      1|Toy Story (1995)|Adventure|Animati...|   579|   4.0| 958881146|1970|
|      1|Toy Story (1995)|Adventure|Animati...|   573|   5.0|1186722182|1970|
|      1|Toy Story (1995)|Adventure|Animati...|   572|   4.0| 945892484|1970|
+-------+----------------+--------------------+------+------+----------+----+
only showing top 20 rows
# RIGHT OUTER
movies_data_outer_r = movies_data.join(ratings_data, ["movieId"], "right")
print(f'{movies_data_outer_r.count()} filas con una unión externa (der)')
movies_data_outer_r.show()
100836 filas con una unión externa (der)
+-------+--------------------+--------------------+------+------+---------+----+
|movieId|               title|              genres|userId|rating|timestamp|year|
+-------+--------------------+--------------------+------+------+---------+----+
|      1|    Toy Story (1995)|Adventure|Animati...|     1|   4.0|964982703|1970|
|      3|Grumpier Old Men ...|      Comedy|Romance|     1|   4.0|964981247|1970|
|      6|         Heat (1995)|Action|Crime|Thri...|     1|   4.0|964982224|1970|
|     47|Seven (a.k.a. Se7...|    Mystery|Thriller|     1|   5.0|964983815|1970|
|     50|Usual Suspects, T...|Crime|Mystery|Thr...|     1|   5.0|964982931|1970|
|     70|From Dusk Till Da...|Action|Comedy|Hor...|     1|   3.0|964982400|1970|
|    101|Bottle Rocket (1996)|Adventure|Comedy|...|     1|   5.0|964980868|1970|
|    110|   Braveheart (1995)|    Action|Drama|War|     1|   4.0|964982176|1970|
|    151|      Rob Roy (1995)|Action|Drama|Roma...|     1|   5.0|964984041|1970|
|    157|Canadian Bacon (1...|          Comedy|War|     1|   5.0|964984100|1970|
|    163|    Desperado (1995)|Action|Romance|We...|     1|   5.0|964983650|1970|
|    216|Billy Madison (1995)|              Comedy|     1|   5.0|964981208|1970|
|    223|       Clerks (1994)|              Comedy|     1|   3.0|964980985|1970|
|    231|Dumb & Dumber (Du...|    Adventure|Comedy|     1|   5.0|964981179|1970|
|    235|      Ed Wood (1994)|        Comedy|Drama|     1|   4.0|964980908|1970|
|    260|Star Wars: Episod...|Action|Adventure|...|     1|   5.0|964981680|1970|
|    296| Pulp Fiction (1994)|Comedy|Crime|Dram...|     1|   3.0|964982967|1970|
|    316|     Stargate (1994)|Action|Adventure|...|     1|   3.0|964982310|1970|
|    333|    Tommy Boy (1995)|              Comedy|     1|   5.0|964981179|1970|
|    349|Clear and Present...|Action|Crime|Dram...|     1|   4.0|964982563|1970|
+-------+--------------------+--------------------+------+------+---------+----+
only showing top 20 rows

Combinación externa completa (Full Join)#

En algunos sistemas, una combinación externa completa incluye todas las filas de ambas tablas, con filas combinadas cuando se correspondan.

movies_data_full = movies_data.join(ratings_data, ["movieId"], "full")
print(f'{movies_data_full.count()} filas con una unión completa')
movies_data_full.show()
100854 filas con una unión completa
+-------+----------------+--------------------+------+------+----------+----+
|movieId|           title|              genres|userId|rating| timestamp|year|
+-------+----------------+--------------------+------+------+----------+----+
|      1|Toy Story (1995)|Adventure|Animati...|     1|   4.0| 964982703|1970|
|      1|Toy Story (1995)|Adventure|Animati...|     5|   4.0| 847434962|1970|
|      1|Toy Story (1995)|Adventure|Animati...|     7|   4.5|1106635946|1970|
|      1|Toy Story (1995)|Adventure|Animati...|    15|   2.5|1510577970|1970|
|      1|Toy Story (1995)|Adventure|Animati...|    17|   4.5|1305696483|1970|
|      1|Toy Story (1995)|Adventure|Animati...|    18|   3.5|1455209816|1970|
|      1|Toy Story (1995)|Adventure|Animati...|    19|   4.0| 965705637|1970|
|      1|Toy Story (1995)|Adventure|Animati...|    21|   3.5|1407618878|1970|
|      1|Toy Story (1995)|Adventure|Animati...|    27|   3.0| 962685262|1970|
|      1|Toy Story (1995)|Adventure|Animati...|    31|   5.0| 850466616|1970|
|      1|Toy Story (1995)|Adventure|Animati...|    32|   3.0| 856736119|1970|
|      1|Toy Story (1995)|Adventure|Animati...|    33|   3.0| 939647444|1970|
|      1|Toy Story (1995)|Adventure|Animati...|    40|   5.0| 832058959|1970|
|      1|Toy Story (1995)|Adventure|Animati...|    43|   5.0| 848993983|1970|
|      1|Toy Story (1995)|Adventure|Animati...|    44|   3.0| 869251860|1970|
|      1|Toy Story (1995)|Adventure|Animati...|    45|   4.0| 951170182|1970|
|      1|Toy Story (1995)|Adventure|Animati...|    46|   5.0| 834787906|1970|
|      1|Toy Story (1995)|Adventure|Animati...|    50|   3.0|1514238116|1970|
|      1|Toy Story (1995)|Adventure|Animati...|    54|   3.0| 830247330|1970|
|      1|Toy Story (1995)|Adventure|Animati...|    57|   5.0| 965796031|1970|
+-------+----------------+--------------------+------+------+----------+----+
only showing top 20 rows

Extracción de información#

Valoraciones promedio por película#

rating_by_movie =(ratings_data
                  .groupby("movieId")
                  .agg(fct.count("rating").alias("# reviews"), fct.avg("rating").alias("avg rating")) #agg permite aplicar varias funciones de agregacion, en este caso ademas del conteo se genera el promedio de valoraciones
                  .join(movies_data, ["movieID"], "inner")
                  .select(["title", "avg rating", "# reviews"]))
rating_by_movie.orderBy(["# reviews"], ascending=False).show()
+--------------------+------------------+---------+
|               title|        avg rating|# reviews|
+--------------------+------------------+---------+
| Forrest Gump (1994)| 4.164133738601824|      329|
|Shawshank Redempt...| 4.429022082018927|      317|
| Pulp Fiction (1994)| 4.197068403908795|      307|
|Silence of the La...| 4.161290322580645|      279|
|  Matrix, The (1999)| 4.192446043165468|      278|
|Star Wars: Episod...| 4.231075697211155|      251|
|Jurassic Park (1993)|              3.75|      238|
|   Braveheart (1995)| 4.031645569620253|      237|
|Terminator 2: Jud...| 3.970982142857143|      224|
|Schindler's List ...|             4.225|      220|
|   Fight Club (1999)| 4.272935779816514|      218|
|    Toy Story (1995)|3.9209302325581397|      215|
|Star Wars: Episod...|4.2156398104265405|      211|
|Usual Suspects, T...| 4.237745098039215|      204|
|American Beauty (...| 4.056372549019608|      204|
|Seven (a.k.a. Se7...|3.9753694581280787|      203|
|Independence Day ...|3.4455445544554455|      202|
|    Apollo 13 (1995)| 3.845771144278607|      201|
|Raiders of the Lo...|            4.2075|      200|
|Lord of the Rings...| 4.106060606060606|      198|
+--------------------+------------------+---------+
only showing top 20 rows
(rating_by_movie
 .orderBy(["avg rating"], ascending=False)
 .show())
+--------------------+----------+---------+
|               title|avg rating|# reviews|
+--------------------+----------+---------+
|Passenger, The (P...|       5.0|        1|
|Five Senses, The ...|       5.0|        1|
|Tales of Manhatta...|       5.0|        1|
|Tenchi Muyô! In L...|       5.0|        1|
|Denise Calls Up (...|       5.0|        1|
|  Palindromes (2004)|       5.0|        1|
|Strictly Sexual (...|       5.0|        1|
|Supercop 2 (Proje...|       5.0|        1|
|What Happened Was...|       5.0|        1|
|          61* (2001)|       5.0|        1|
|One I Love, The (...|       5.0|        1|
|Valet, The (La do...|       5.0|        1|
|Who Killed Chea V...|       5.0|        1|
|Scooby-Doo! Abrac...|       5.0|        1|
|     Lamerica (1994)|       5.0|        2|
|  Indignation (2016)|       5.0|        1|
|Awfully Big Adven...|       5.0|        1|
|Raise Your Voice ...|       5.0|        1|
|Adventures Of She...|       5.0|        1|
|Zeitgeist: Moving...|       5.0|        1|
+--------------------+----------+---------+
only showing top 20 rows
(rating_by_movie
 .where(rating_by_movie["# reviews"]>10) #promedio de peliculas con mas de 10 valoraciones
 .orderBy(["avg rating"])
 .show())
+--------------------+------------------+---------+
|               title|        avg rating|# reviews|
+--------------------+------------------+---------+
|Problem Child (1990)|1.5833333333333333|       12|
|Speed 2: Cruise C...| 1.605263157894737|       19|
|Flintstones in Vi...|             1.625|       12|
|Battlefield Earth...|1.6578947368421053|       19|
|Superman IV: The ...|            1.6875|       16|
|Karate Kid, Part ...|              1.75|       14|
|    Beethoven (1992)|1.7727272727272727|       11|
|Stop! Or My Mom W...|1.7727272727272727|       11|
|Dungeons & Dragon...|1.8333333333333333|       12|
|Sister Act 2: Bac...|1.8928571428571428|       14|
|    Rambo III (1988)|1.9166666666666667|       12|
|  Ultraviolet (2006)|1.9230769230769231|       13|
|     Anaconda (1997)|1.9259259259259258|       27|
|      Rocky V (1990)|1.9411764705882353|       17|
|Dumb and Dumberer...|1.9545454545454546|       11|
|     Godzilla (1998)|1.9545454545454546|       33|
|Super Mario Bros....|               2.0|       21|
|Honey, I Blew Up ...|              2.05|       20|
|I Still Know What...|2.0555555555555554|       18|
|Teenage Mutant Ni...|2.0714285714285716|       14|
+--------------------+------------------+---------+
only showing top 20 rows
rating_by_movie.show()
+--------------------+------------------+---------+
|               title|        avg rating|# reviews|
+--------------------+------------------+---------+
|Men in Black (a.k...| 3.487878787878788|      165|
|    King Kong (1933)|              3.64|       25|
| Galaxy Quest (1999)|              3.58|       75|
|Dirty Dancing (1987)| 3.369047619047619|       42|
|Knockin' on Heave...|              4.25|        4|
|Ice Age 2: The Me...| 3.217391304347826|       23|
|Searching for Sug...|              4.25|        4|
|   Local Hero (1983)| 4.055555555555555|        9|
|     Candyman (1992)|               2.5|       11|
|        Spawn (1997)|2.6346153846153846|       26|
|The Devil's Advoc...| 3.411764705882353|       51|
|Land Before Time,...|3.3333333333333335|        9|
|American Tail: Fi...|               2.7|       10|
|Hudsucker Proxy, ...|              3.55|       40|
|Dungeons & Dragon...|1.8333333333333333|       12|
|High School High ...|               2.0|        6|
|Hellbound: Hellra...|3.2777777777777777|        9|
|Tale of Two Siste...|              3.25|        4|
|Out of Africa (1985)|3.6666666666666665|       15|
|     17 Again (2009)|              3.55|       10|
+--------------------+------------------+---------+
only showing top 20 rows
movies_data_genre.show()
+-------+--------------------+---------+
|movieId|               title|    genre|
+-------+--------------------+---------+
|      1|    Toy Story (1995)|Adventure|
|      1|    Toy Story (1995)|Animation|
|      1|    Toy Story (1995)| Children|
|      1|    Toy Story (1995)|   Comedy|
|      1|    Toy Story (1995)|  Fantasy|
|      2|      Jumanji (1995)|Adventure|
|      2|      Jumanji (1995)| Children|
|      2|      Jumanji (1995)|  Fantasy|
|      3|Grumpier Old Men ...|   Comedy|
|      3|Grumpier Old Men ...|  Romance|
|      4|Waiting to Exhale...|   Comedy|
|      4|Waiting to Exhale...|    Drama|
|      4|Waiting to Exhale...|  Romance|
|      5|Father of the Bri...|   Comedy|
|      6|         Heat (1995)|   Action|
|      6|         Heat (1995)|    Crime|
|      6|         Heat (1995)| Thriller|
|      7|      Sabrina (1995)|   Comedy|
|      7|      Sabrina (1995)|  Romance|
|      8| Tom and Huck (1995)|Adventure|
+-------+--------------------+---------+
only showing top 20 rows

Valoraciones promedio por género#

(rating_by_movie
 .join(movies_data_genre, ["title"], "inner")
 .groupby("genre")
 .agg(fct.count("genre").alias("# movies"), fct.avg("avg rating").alias("avg rating"), fct.sum("# reviews").alias("# reviews"))
 .orderBy(["avg rating"], ascending=False)
 .show())
+------------------+--------+------------------+---------+
|             genre|# movies|        avg rating|# reviews|
+------------------+--------+------------------+---------+
|       Documentary|     438|3.7816816901269976|     1219|
|         Film-Noir|      85| 3.670470937518959|      870|
|               War|     381|3.5716549384975473|     4859|
|         Animation|     610| 3.497119150128773|     6988|
|             Drama|    4354|3.4212488182273826|    41947|
|           Western|     167|3.3833889087601254|     1930|
|(no genres listed)|      34|3.3711484593837535|       47|
|           Romance|    1595|3.3655331380531415|    18171|
|           Mystery|     573|3.3303636065388305|     7674|
|              IMAX|     158| 3.311996371092887|     4145|
|             Crime|    1198|3.3026754773900597|    16697|
|           Musical|     333| 3.296370961846862|     4138|
|           Fantasy|     778| 3.218843125330384|    11834|
|         Adventure|    1264|3.2144936850836103|    24164|
|            Comedy|    3756|3.1821302564909026|    39070|
|          Thriller|    1894|3.1564167433398382|    26475|
|          Children|     664|3.1076903605293116|     9208|
|            Sci-Fi|     984|3.1021186023502407|    17300|
|            Action|    1830|3.0944771394149706|    30687|
|            Horror|     977|2.9189646997100787|     7291|
+------------------+--------+------------------+---------+
ratings_data.show()
+------+-------+------+---------+----+
|userId|movieId|rating|timestamp|year|
+------+-------+------+---------+----+
|     1|      1|   4.0|964982703|1970|
|     1|      3|   4.0|964981247|1970|
|     1|      6|   4.0|964982224|1970|
|     1|     47|   5.0|964983815|1970|
|     1|     50|   5.0|964982931|1970|
|     1|     70|   3.0|964982400|1970|
|     1|    101|   5.0|964980868|1970|
|     1|    110|   4.0|964982176|1970|
|     1|    151|   5.0|964984041|1970|
|     1|    157|   5.0|964984100|1970|
|     1|    163|   5.0|964983650|1970|
|     1|    216|   5.0|964981208|1970|
|     1|    223|   3.0|964980985|1970|
|     1|    231|   5.0|964981179|1970|
|     1|    235|   4.0|964980908|1970|
|     1|    260|   5.0|964981680|1970|
|     1|    296|   3.0|964982967|1970|
|     1|    316|   3.0|964982310|1970|
|     1|    333|   5.0|964981179|1970|
|     1|    349|   4.0|964982563|1970|
+------+-------+------+---------+----+
only showing top 20 rows

Valoraciones promedio por usuario#

rating_by_user =(ratings_data
                  .groupby("userId")
                  .agg(fct.count("rating").alias("# reviews"), fct.avg("rating").alias("avg rating"))
                  .select(["userId", "avg rating", "# reviews"]))
rating_by_user.orderBy(["# reviews"], ascending=False).show()
+------+------------------+---------+
|userId|        avg rating|# reviews|
+------+------------------+---------+
|   414| 3.391957005189029|     2698|
|   599|2.6420500403551253|     2478|
|   474| 3.398956356736243|     2108|
|   448|2.8473712446351933|     1864|
|   274| 3.235884101040119|     1346|
|   610|3.6885560675883258|     1302|
|    68| 3.233730158730159|     1260|
|   380|3.6732348111658455|     1218|
|   606|3.6573991031390136|     1115|
|   288|3.1459715639810426|     1055|
|   249|3.6964627151051626|     1046|
|   387|3.2585199610516065|     1027|
|   182|3.5112589559877176|      977|
|   307|2.6656410256410257|      975|
|   603|3.5079533404029695|      943|
|   298| 2.363684771033014|      939|
|   177| 3.375553097345133|      904|
|   318| 3.755972696245734|      879|
|   232|3.2505800464037122|      862|
|   480|  3.27511961722488|      836|
+------+------------------+---------+
only showing top 20 rows
(rating_by_user
 .where(rating_by_user["# reviews"]>10)
 .orderBy(["avg rating"], ascending=False)
 .show())
+------+------------------+---------+
|userId|        avg rating|# reviews|
+------+------------------+---------+
|    53|               5.0|       20|
|   251| 4.869565217391305|       23|
|   515| 4.846153846153846|       26|
|    25|4.8076923076923075|       26|
|    30| 4.735294117647059|       34|
|   523| 4.693333333333333|       75|
|   348| 4.672727272727273|       55|
|   171| 4.634146341463414|       82|
|   452| 4.556930693069307|      202|
|    43| 4.552631578947368|      114|
|   371| 4.548780487804878|       41|
|   122| 4.546232876712328|      292|
|   441| 4.522222222222222|       45|
|   400| 4.511627906976744|       43|
|    52| 4.476923076923077|      130|
|   538| 4.472972972972973|       37|
|   168| 4.462765957446808|       94|
|   417| 4.462686567164179|       67|
|   543| 4.453947368421052|       76|
|   106|4.4393939393939394|       33|
+------+------------------+---------+
only showing top 20 rows
(rating_by_user
 .where(rating_by_user["# reviews"]>10)
 .orderBy(["avg rating"])
 .show())
+------+------------------+---------+
|userId|        avg rating|# reviews|
+------+------------------+---------+
|   442|             1.275|       20|
|   139|2.1443298969072164|      194|
|   508|2.1458333333333335|       24|
|   153| 2.217877094972067|      179|
|   567|2.2454545454545456|      385|
|   311|2.3392857142857144|       28|
|   298| 2.363684771033014|      939|
|   517|           2.38625|      400|
|   308|2.4260869565217393|      115|
|     3|2.4358974358974357|       39|
|   255|2.5681818181818183|       44|
|    22|2.5714285714285716|      119|
|   571|2.5714285714285716|      112|
|   297|2.5972222222222223|       72|
|    19| 2.607396870554765|      703|
|   294| 2.610983981693364|      437|
|   293| 2.619047619047619|       21|
|   287|2.6217105263157894|      152|
|    36|2.6333333333333333|       60|
|   333|              2.64|       25|
+------+------------------+---------+
only showing top 20 rows

Valoraciones promedio por año#

reg_exp = "(\d{4})" # Expresión regular para las fechas (4 dígitos)
movies_data_year = (movies_data
                    .withColumn("movie year",fct.regexp_extract("title",reg_exp,0)) # Generar una nueva columna con el año a partir del título
                    .select("movieId","title", "movie year"))
movies_data_year = (movies_data_year
                    .where(movies_data_year["movie year"]>1950) # Dados errores en los datos, seleccionar solo las
                    .where(movies_data_year["movie year"]<2023)# películas entre 1950 y 2023
                    .orderBy(movies_data_year["movie year"],ascending=True))
movies_data_year.show()
+-------+--------------------+----------+
|movieId|               title|movie year|
+-------+--------------------+----------+
|   4813|When Worlds Colli...|      1951|
|   1032|Alice in Wonderla...|      1951|
|   5168|Royal Wedding (1951)|      1951|
|    900|American in Paris...|      1951|
|   5603|Lavender Hill Mob...|      1951|
|   1104|Streetcar Named D...|      1951|
|   5604|Man in the White ...|      1951|
|   2186|Strangers on a Tr...|      1951|
|   6181|Red Badge of Cour...|      1951|
|   3315|Happy Go Lovely (...|      1951|
|   7211|People Will Talk ...|      1951|
|   3475|Place in the Sun,...|      1951|
|   7301|Diary of a Countr...|      1951|
|  34482|Browning Version,...|      1951|
|   8492|Christmas Carol, ...|      1951|
|   1253|Day the Earth Sto...|      1951|
|   8502|    Show Boat (1951)|      1951|
|   3406|Captain Horatio H...|      1951|
|  33781|    Quo Vadis (1951)|      1951|
|    969|African Queen, Th...|      1951|
+-------+--------------------+----------+
only showing top 20 rows
rating_by_year = (rating_by_movie
                  .join(movies_data_year, ["title"], "inner")
                  .groupby("movie year")
                  .agg(fct.count("movie year").alias("# movies"), fct.avg("avg rating").alias("avg rating"), fct.sum("# reviews").alias("# reviews")))
rating_by_year.show(5)
+----------+--------+------------------+---------+
|movie year|# movies|        avg rating|# reviews|
+----------+--------+------------------+---------+
|      1953|      29| 3.586757381017263|      153|
|      1957|      33|3.7676802486013017|      215|
|      1987|     152| 3.047063165252584|     1538|
|      1956|      30|3.2780932030932033|      115|
|      2016|     218|3.1037478772713865|      785|
+----------+--------+------------------+---------+
only showing top 5 rows
(rating_by_year
 .where(rating_by_year["# reviews"]>100)
 .orderBy(["avg rating"], ascending = False)
 .show(5))
+----------+--------+------------------+---------+
|movie year|# movies|        avg rating|# reviews|
+----------+--------+------------------+---------+
|      1971|      46|3.8117978621974786|      526|
|      1957|      33|3.7676802486013017|      215|
|      1951|      21| 3.718043884220355|      182|
|      1955|      36|  3.67286314229249|      182|
|      1958|      31|3.6680784638090143|      165|
+----------+--------+------------------+---------+
only showing top 5 rows
(rating_by_year
 .orderBy(["# movies"], ascending = False)
 .show(5))
+----------+--------+------------------+---------+
|movie year|# movies|        avg rating|# reviews|
+----------+--------+------------------+---------+
|      2002|     312| 3.206828593714281|     3657|
|      2006|     296| 3.272408199307826|     2585|
|      2001|     295| 3.249999115312597|     4018|
|      2000|     290| 3.225102701405015|     4308|
|      2007|     283|3.3251327591311375|     2293|
+----------+--------+------------------+---------+
only showing top 5 rows

Mejor película por año#

best_movie_by_year = (rating_by_movie
                      .join(movies_data_year, ["title"], "inner")
                      .where(rating_by_movie["# reviews"] > 10)
                      .orderBy(["avg rating"], ascending=False)
                      .groupby("movie year")
                      .agg(fct.first("title").alias("title"), fct.first("avg rating").alias("avg rating")))
best_movie_by_year.orderBy(["movie year"], ascending=False).show()
+----------+--------------------+------------------+
|movie year|               title|        avg rating|
+----------+--------------------+------------------+
|      2018|Avengers: Infinit...|               4.0|
|      2017|        Logan (2017)|              4.28|
|      2016|      Arrival (2016)| 3.980769230769231|
|      2015|    Spotlight (2015)| 4.157894736842105|
|      2014| Nightcrawler (2014)| 4.166666666666667|
|      2013|    Prisoners (2013)|           4.15625|
|      2012|Zero Dark Thirty ...| 4.107142857142857|
|      2011| Intouchables (2011)| 4.108108108108108|
|      2010|   Inside Job (2010)| 4.291666666666667|
|      2009|Inglourious Baste...| 4.136363636363637|
|      2008|Dark Knight, The ...| 4.238255033557047|
|      2007|   Persepolis (2007)| 4.181818181818182|
|      2006|Departed, The (2006)| 4.252336448598131|
|      2005|Cinderella Man (2...| 4.088235294117647|
|      2004|Eternal Sunshine ...|4.1603053435114505|
|      2003|Fog of War: Eleve...|4.3076923076923075|
|      2002|City of God (Cida...|4.1466666666666665|
|      2001|No Man's Land (2001)|4.1923076923076925|
|      2000|Boondock Saints, ...|  4.22093023255814|
|      1999|   Fight Club (1999)| 4.272935779816514|
+----------+--------------------+------------------+
only showing top 20 rows
best_movie_by_year.orderBy(["movie year"]).show()
+----------+--------------------+------------------+
|movie year|               title|        avg rating|
+----------+--------------------+------------------+
|      1951|Streetcar Named D...|             4.475|
|      1952|    High Noon (1952)|4.2105263157894735|
|      1953|Roman Holiday (1953)|4.0576923076923075|
|      1954|  Rear Window (1954)| 4.261904761904762|
|      1955|To Catch a Thief ...| 4.217391304347826|
|      1956|King and I, The (...| 4.166666666666667|
|      1957|Paths of Glory (1...| 4.541666666666667|
|      1958|Touch of Evil (1958)| 4.264705882352941|
|      1959|North by Northwes...| 4.184210526315789|
|      1960|       Psycho (1960)| 4.036144578313253|
|      1961| Hustler, The (1961)| 4.333333333333333|
|      1962|Lawrence of Arabi...|               4.3|
|      1963|Great Escape, The...| 4.127906976744186|
|      1964|Dr. Strangelove o...| 4.268041237113402|
|      1965|For a Few Dollars...| 4.151515151515151|
|      1966|Good, the Bad and...| 4.145833333333333|
|      1967|Guess Who's Comin...| 4.545454545454546|
|      1968|Once Upon a Time ...| 4.305555555555555|
|      1969|Midnight Cowboy (...|3.9545454545454546|
|      1970|Little Big Man (1...| 4.145833333333333|
+----------+--------------------+------------------+
only showing top 20 rows

Mejor película para cada usuario#

best_movie_by_user = (ratings_data
                      .join(movies_data, ["movieId"], "inner")
                      .orderBy(["rating"], ascending=False)
                      .groupby("userId")
                      .agg(fct.first("title").alias("title"), fct.first("rating").alias("rating")))
best_movie_by_user.show() # Con la precisión de que obtenemos solo un resultado por usuario cuando puede haber varias películas con valoración 5.0
+------+--------------------+------+
|userId|               title|rating|
+------+--------------------+------+
|     1|Seven (a.k.a. Se7...|   5.0|
|     2|Step Brothers (2008)|   5.0|
|     3|Escape from L.A. ...|   5.0|
|     4|Flirting With Dis...|   5.0|
|     5|Postman, The (Pos...|   5.0|
|     6|Grumpier Old Men ...|   5.0|
|     7|Star Wars: Episod...|   5.0|
|     8|         Babe (1995)|   5.0|
|     9| Citizen Kane (1941)|   5.0|
|    10|         Troy (2004)|   5.0|
|    11|         Heat (1995)|   5.0|
|    12| First Knight (1995)|   5.0|
|    13|Seven (a.k.a. Se7...|   5.0|
|    14|Seven (a.k.a. Se7...|   5.0|
|    15|Star Wars: Episod...|   5.0|
|    16|       Brazil (1985)|   5.0|
|    17|Star Wars: Episod...|   5.0|
|    18|Usual Suspects, T...|   5.0|
|    19|    Crow, The (1994)|   5.0|
|    20|   Pocahontas (1995)|   5.0|
+------+--------------------+------+
only showing top 20 rows
(ratings_data
 .join(movies_data_year, ["movieId"], "inner")
 .where("userId = 1 AND rating = 5.0")
 .select(["title", "movie year"])
 .show())
+--------------------+----------+
|               title|movie year|
+--------------------+----------+
|Seven (a.k.a. Se7...|      1995|
|Usual Suspects, T...|      1995|
|Bottle Rocket (1996)|      1996|
|      Rob Roy (1995)|      1995|
|Canadian Bacon (1...|      1995|
|    Desperado (1995)|      1995|
|Billy Madison (1995)|      1995|
|Dumb & Dumber (Du...|      1994|
|Star Wars: Episod...|      1977|
|    Tommy Boy (1995)|      1995|
|Jungle Book, The ...|      1994|
|Fugitive, The (1993)|      1993|
|Schindler's List ...|      1993|
|    Tombstone (1993)|      1993|
|        Fargo (1996)|      1996|
|James and the Gia...|      1996|
|Winnie the Pooh a...|      1968|
|Sword in the Ston...|      1963|
|Bedknobs and Broo...|      1971|
|Alice in Wonderla...|      1951|
+--------------------+----------+
only showing top 20 rows

Ejercicios#

¿Cuál es la peor película por cada año?#

worst_movie_by_year = (rating_by_movie
                      .join(movies_data_year, ["title"], "inner")
                      .where(rating_by_movie["# reviews"] > 10)
                      .orderBy(["avg rating"], ascending=True)
                      .groupby("movie year")
                      .agg(fct.first("title").alias("title"), fct.first("avg rating").alias("avg rating")))
worst_movie_by_year.orderBy(["movie year"], ascending=False).show()
+----------+--------------------+------------------+
|movie year|               title|        avg rating|
+----------+--------------------+------------------+
|      2018|   Deadpool 2 (2018)|             3.875|
|      2017|Star Wars: The La...|             3.125|
|      2016|Batman v Superman...|           2.34375|
|      2015|      Spectre (2015)|3.0588235294117645|
|      2014|     Godzilla (2014)|2.6153846153846154|
|      2013|  World War Z (2013)| 3.026315789473684|
|      2012|         2012 (2009)| 2.619047619047619|
|      2011|Transformers: Dar...|2.3636363636363638|
|      2010|Clash of the Tita...|2.3076923076923075|
|      2009|Transformers: Rev...|             2.425|
|      2008|     Twilight (2008)| 2.409090909090909|
|      2007|  Ghost Rider (2007)|              2.25|
|      2006|  Ultraviolet (2006)|1.9230769230769231|
|      2005|Dukes of Hazzard,...|2.0833333333333335|
|      2004|Stepford Wives, T...|              2.25|
|      2003|Dumb and Dumberer...|1.9545454545454546|
|      2002|The Scorpion King...|2.2954545454545454|
|      2001|  Animal, The (2001)|2.1363636363636362|
|      2000|Flintstones in Vi...|             1.625|
|      1999|Inspector Gadget ...|2.0952380952380953|
+----------+--------------------+------------------+
only showing top 20 rows

¿Cuál es el género favorito de cada usuario?#

Género más frecuente de las películas que el usuario valoró con 5.0.

Pistas

  • Los géneros de las películas están en la tabla movies_data_genre
  • Para obtener un conteo de cada género por cada usuario se pueden agrupar las dos variables .groupby(["userId", "genre"])
  • Después se puede usar la secuencia de pasos en los ejemplos anteriores con .fct.first("genre")

genre_by_user = (ratings_data
                      .join(movies_data_genre,["movieId"],"inner") #inner intercepta las dos tablas, devuelve los registros con correspondencia en las dos tablas
                      .where("rating = 5.0")
                      .groupby(["userId", "genre"])
                      .count()
                      .groupby(["userId"])
                      .agg(fct.first("genre").alias("Genero"),fct.first("count").alias("count"))) #fct.first nos deja el primer valor
genre_by_user.show()
+------+---------+-----+
|userId|   Genero|count|
+------+---------+-----+
|     1|    Drama|   42|
|     2| Thriller|    1|
|     3|   Comedy|    1|
|     4|Animation|    2|
|     5|  Musical|    3|
|     6| Children|    8|
|     7|Adventure|    5|
|     8|Animation|    1|
|     9|   Sci-Fi|    1|
|    10|   Action|    6|
|    11|   Horror|    1|
|    12|   Action|    1|
|    13|    Crime|    1|
|    14|  Romance|    2|
|    15|   Action|   10|
|    16|  Fantasy|    1|
|    17| Thriller|    4|
|    18|  Western|    2|
|    19|Film-Noir|    1|
|    20|Adventure|   17|
+------+---------+-----+
only showing top 20 rows

¿Cuál es el género menos preferido por el usuario?#

Género más frecuente de las 5 películas a las que el usuario les dio una valoración menor a 2.

genre_by_user = (ratings_data
                      .join(movies_data_genre,["movieId"],"inner") #inner intercepta las dos tablas, devuelve los registros con correspondencia en las dos tablas
                      .where("rating = 0.5")
                      .groupby(["userId", "genre"])
                      .count()
                      .groupby(["userId"])
                      .agg(fct.first("genre").alias("Genero"),fct.first("count").alias("count"))) #fct.first nos deja el primer valor
genre_by_user.show()
+------+---------+-----+
|userId|   Genero|count|
+------+---------+-----+
|     3|    Crime|    2|
|     7|Adventure|    2|
|    10|   Action|    2|
|    18|   Comedy|    1|
|    20|Adventure|    7|
|    21|  Fantasy|    3|
|    22| Thriller|    6|
|    28|  Romance|    1|
|    34| Thriller|    1|
|    36|   Comedy|    1|
|    41|     IMAX|    2|
|    47|   Comedy|    2|
|    50|Adventure|    1|
|    51|  Western|    1|
|    55|  Fantasy|    2|
|    63|     IMAX|    1|
|    64|    Drama|    1|
|    66|   Horror|    1|
|    68|   Sci-Fi|    1|
|    75|Adventure|    1|
+------+---------+-----+
only showing top 20 rows