Trabajo en clase - MapReduce#

https://drive.google.com/file/d/16QktQoLSM81WJsjADSDnc-QPIMROBFhb/view?usp=drive_link https://drive.google.com/file/d/1VGRQAvytiDTQT2GALQVu81C9u99Cj0ju/view?usp=drive_link

pip install mrjob
Collecting mrjob
  Downloading mrjob-0.7.4-py2.py3-none-any.whl.metadata (7.3 kB)
Requirement already satisfied: PyYAML>=3.10 in /usr/local/lib/python3.12/dist-packages (from mrjob) (6.0.2)
Downloading mrjob-0.7.4-py2.py3-none-any.whl (439 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 439.6/439.6 kB 6.9 MB/s eta 0:00:00
?25hInstalling collected packages: mrjob
Successfully installed mrjob-0.7.4

Ejercicio 1: Estadísticas de Texto#

Completa la clase TextStats implementando un programa map-reduce que, dado un texto de entrada, cuente:

  • El número de caracteres

  • El número de líneas

  • El número de palabras

y emita como resultado tres tuplas como en el ejemplo siguiente.

Ejemplo de ejecución

Ancient influences have helped spawn variant interpretations
of the nature of history which have evolved over the centuries
and continue to change today. The modern study of history is
wide-ranging, and includes the study of specific regions and
the study of certain topical or thematical elements of
historical investigation. Often history is taught as part of
primary and secondary education, and the academic study of
history is a major discipline in University studies

Resultado

"chars"	472
"lines"	8
"words"	73

Nota: Dado un string s usa la función s.split() para obtener obtener una lista con los conjuntos de caracteres del string que están separados por espacios en blanco o usa la búsqueda por expresiones regulares vista en clase. Usa la función len para obtener la longitud de un string o de una lista.

%%writefile /content/TextStats.py
from mrjob.job import MRJob

class TextStats(MRJob):

    def mapper(self, _, line):
        tokens = line.split()

        yield "Palabras", len(tokens)
        yield "Caracter", len(line)
        yield "Lineas", 1


    def reducer(self, key, values):
        # TU CODIGO AQUI
        yield key, sum(values)

if __name__ == '__main__':
    TextStats.run()
Overwriting /content/TextStats.py
%%script python TextStats.py --quiet
Ancient influences have helped spawn variant interpretations
of the nature of history which have evolved over the centuries
and continue to change today. The modern study of history is
wide-ranging, and includes the study of specific regions and
the study of certain topical or thematical elements of
historical investigation. Often history is taught as part of
primary and secondary education, and the academic study of
history is a major discipline in University studies
"Lineas"	8
"Palabras"	73
"Caracter"	465

Ejercicio 2: Número medio de palabras por línea#

Completa la clase LineAverage para que como resultado final emita únicamente un par clave, valor con el número medio de palabras por línea en el texto de entrada.

Para el mismo texto anterior, la salida debería de ser:

"avg"	9.125
%%writefile LineAverage.py
from mrjob.job import MRJob
import numpy as np

class LineAverage(MRJob):

    def mapper(self, _, line):
        tokens = line.split()
        yield "avg", len(tokens)


    def reducer(self, key, values):


        yield key, np.average(list(values))

if __name__ == '__main__':
    LineAverage.run()
Writing LineAverage.py

verifica que la salida es la esperada y prueba con nuevos valores. Elimina la opción --quiet para ver los mensajes de error de tu código si no produce la salida esperada

%%script python LineAverage.py -q
Ancient influences have helped spawn variant interpretations
of the nature of history which have evolved over the centuries
and continue to change today. The modern study of history is
wide-ranging, and includes the study of specific regions and
the study of certain topical or thematical elements of
historical investigation. Often history is taught as part of
primary and secondary education, and the academic study of
history is a major discipline in University studies
"avg"	9.125

Ejercicio 3: Índice invertido#

Completa la clase InvertedIndex implementando un programa map-reduce que cree un indice invertido para una lista de documentos. Es decir, para cada palabra en nuestra colección de documentos el índice invertido no da en qué documentos aparece. Estos índices se usan posteriormente para responder a queries:

Ejemplo de ejecución (entrada es en formato json)

["001", "the car is nice"]
["002", "that car is mine"]
["003", "that shirt is nice"]  
["004", "the car is the best"]

Resultado

"best"  ["004"]
"car"   ["001", "002", "004"]
"is"    ["001", "002", "003", "004"]
"mine"  ["002"]
"nice"  ["001", "003"]
"shirt" ["003"]
"that"  ["002", "003"]
"the"   ["001", "004"]

Nota: usa la linea:

key, text = json.loads(line)

_para obtener extraer el id del documento y el texto del mismo del formato json _

%%writefile InvertedIndex.py
from mrjob.job import MRJob
import json

class InvertedIndex(MRJob):

    def mapper(self, _, line):
        Ind, text = json.loads(line)
        # TU CODIGO AQUI


    def reducer(self, key, values):
        # TU CODIGO AQUI

if __name__ == '__main__':
    InvertedIndex.run()
Writing InvertedIndex.py

Verifica que la salida es la esperada y prueba con nuevos valores.

%%script python InvertedIndex.py -q
["001", "the car is nice"]
["002", "that car is mine"]
["003", "that shirt is nice"]
["004", "the car is the best"]
"mine"	["002"]
"nice"	["001", "003"]
"shirt"	["003"]
"that"	["002", "003"]
"the"	["001", "004", "004"]
"best"	["004"]
"car"	["001", "002", "004"]
"is"	["001", "002", "003", "004"]
%%script python InvertedIndex.py -q
["001", "War is a state of armed conflict between autonomous organizations (such as states and non-state actors) or coalitions of such organizations. It is generally characterized by extreme collective aggression, destruction, and usually high mortality. The set of techniques used by a group to carry out war is known as warfare. An absence of war is usually called peace."]
["032", "War must entail some degree of confrontation using weapons and other military technology and equipment by armed forces employing military tactics and operational art within a broad military strategy subject to military logistics."]
["105", "While some scholars see warfare as a universal and ancestral aspect of human nature, others argue that it is only a result of specific socio-cultural or ecological circumstances"]
"KEY"	"VALUE"

Ejercicio 4: Suma de Matrices#

Representamos dos matrices a y b de 2x2 de la siguiente forma:

["a", 0, 0, 32]
["a", 0, 1, 69]
["a", 1, 0, 18]
["a", 1, 1, 28]
["b", 0, 0, 18]
["b", 0, 1, 69]
["b", 1, 0, 28]
["b", 1, 1, 32]

es decir

a = [32 69]     b = [18 69]    a + b = [50 138]
    [18 28]         [28 32]            [46  60]

tienes que implementar un programa map-reduce que acepte como entrada las dos matrices y realice la suma

%%writefile MatrixSum.py
from mrjob.job import MRJob
import json

class MatrixSum(MRJob):

    def mapper(self, _, line):


    def reducer(self, key, values):



if __name__ == '__main__':
    MatrixSum.run()
Writing MatrixSum.py
%%script python MatrixSum.py -q
["a", 0, 0, 32]
["a", 0, 1, 69]
["a", 1, 0, 18]
["a", 1, 1, 28]
["b", 0, 0, 18]
["b", 0, 1, 69]
["b", 1, 0, 28]
["b", 1, 1, 32]
"KEY"	"VALUE"