Machine Translation Exercises
Contents
Machine Translation Exercises#
In these exercises you will develop a machine translation system that can turn modern English into Shakespeare.
Setup 1: Load Libraries#
%load_ext autoreload
%autoreload 2
%matplotlib inline
import sys, os
_snlp_book_dir = ".."
sys.path.append(_snlp_book_dir)
import statnlpbook.word_mt as word_mt
# %cd ..
import sys
sys.path.append("..")
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
matplotlib.rcParams['figure.figsize'] = (10.0, 6.0)
from collections import defaultdict
import statnlpbook.util as util
from statnlpbook.lm import *
from statnlpbook.util import safe_log as log
import statnlpbook.mt as mt
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Input In [1], in <cell line: 7>()
5 _snlp_book_dir = ".."
6 sys.path.append(_snlp_book_dir)
----> 7 import statnlpbook.word_mt as word_mt
8 # %cd ..
9 import sys
ModuleNotFoundError: No module named 'statnlpbook'
$$ \newcommand{\Xs}{\mathcal{X}} \newcommand{\Ys}{\mathcal{Y}} \newcommand{\y}{\mathbf{y}} \newcommand{\balpha}{\boldsymbol{\alpha}} \newcommand{\bbeta}{\boldsymbol{\beta}} \newcommand{\aligns}{\mathbf{a}} \newcommand{\align}{a} \newcommand{\source}{\mathbf{s}} \newcommand{\target}{\mathbf{t}} \newcommand{\ssource}{s} \newcommand{\starget}{t} \newcommand{\repr}{\mathbf{f}} \newcommand{\repry}{\mathbf{g}} \newcommand{\x}{\mathbf{x}} \newcommand{\prob}{p} \newcommand{\vocab}{V} \newcommand{\params}{\boldsymbol{\theta}} \newcommand{\param}{\theta} \DeclareMathOperator{\perplexity}{PP} \DeclareMathOperator{\argmax}{argmax} \DeclareMathOperator{\argmin}{argmin} \newcommand{\train}{\mathcal{D}} \newcommand{\counts}[2]{#_{#1}(#2) } \newcommand{\length}[1]{\text{length}(#1) } \newcommand{\indi}{\mathbb{I}} $$
Setup 2: Download Data#
%%sh
cd ../data
if [ ! -d "shakespeare" ]; then
git clone https://github.com/tokestermw/tensorflow-shakespeare.git shakespeare
cd shakespeare
cat ./data/shakespeare/sparknotes/merged/*_modern.snt.aligned > modern.txt
cat ./data/shakespeare/sparknotes/merged/*_original.snt.aligned > original.txt
cd ..
fi
head -n 1 shakespeare/modern.txt
head -n 1 shakespeare/original.txt
I have half a mind to hit you before you speak again.
I have a mind to strike thee ere thou speak’st.
Task 1: Preprocessing Aligned Corpus#
Write methods for loading and tokenizing the aligned corpus.
import re
NULL = "NULL"
def tokenize(sentence):
return [] # todo
def pre_process(sentence):
return [] # todo
def load_shakespeare(corpus):
with open("../data/shakespeare/%s.txt" % corpus, "r") as f:
return [pre_process(x.rstrip('\n')) for x in f.readlines()]
modern = load_shakespeare("modern")
original = load_shakespeare("original")
MAX_LENGTH = 6
def create_wordmt_pairs(modern, original):
alignments = []
for i in range(len(modern)):
if len(modern[i]) <= MAX_LENGTH and len(original[i]) <= MAX_LENGTH:
alignments.append(([NULL] + modern[i], original[i]))
return alignments
train = create_wordmt_pairs(modern, original)
for i in range(10):
(mod, org) = train[i]
print(" ".join(mod), "|", " ".join(org))
print("\nTotal number of aligned sentence pairs", len(train))
NULL |
NULL |
NULL |
NULL |
NULL |
NULL |
NULL |
NULL |
NULL |
NULL |
Total number of aligned sentence pairs 21079
Task 2: Train IBM Model 2#
Train an IBM Model 2 that translates modern English to Shakespeare
Visualize alignments of the sentence pairs before and after training using EM
Do you find interesting cases?
What are likely words that “killed” can be translated to?
Test your translation system using a beam-search decoder
How does the beam size change the quality of the translation?
Give examples of good and bad translations
# todo
Task 3: Better Language Model#
Try a better language model for machine translation. How does the translation quality change for the examples you found earlier?
# todo
Task 4: Better Decoding#
How can you change the decoder to work to translate to shorter or longer target sequences than the source?
# todo