# Unnecessary Complexity Through Layer Proliferation

In this blog, we continue to elaborate on our loosely `transformer`-based approach to training our computer to “understand” elementary symbolic arithmetic. Just as previously mentioned, a quick and fun intro to the venerable transformer is here.

Our objective is to expand our layers’ repertoire to include the `encoder` and the `decoder` stacks, as well as to ultimately arrive at a model representable by the graph and runnable example.

We need to prep our environment to run any meaningful code:

``````import numpy as np
import tensorflow as tf
import dataset as qd
import ragged as qr
ks = tf.keras
kl = ks.layers
``````

Before we start to focus on our stacks, an important feature of encoding textual inputs needs to be considered. To aid in making sense of a text, we need to include not just the information carried by the tokens themselves but also their position in the input sequence.

Following the “positional encoding” approach from here, we can define our `pos_timing` function as follows.

The quick graphical plot helps us confirm the correctness of our code as the concatenated `sin` and `cos` timing signals give us finely graded and rich positional embeddings:

``````def pos_timing(width, depth):
assert depth % 2 == 0
d = np.arange(depth)[np.newaxis, :]
d = 1 / np.power(10000, (2 * (d // 2)) / np.float32(depth))
t = np.arange(width)[:, np.newaxis] * d
t = [np.sin(t[:, 0::2]), np.cos(t[:, 1::2])]
t = np.concatenate(t, axis=-1)[np.newaxis, ...]
return t

pos = pos_timing(50, 512)

import matplotlib.pyplot as plt
plt.pcolormesh(pos, cmap='RdBu')
plt.xlabel('Depth')
plt.xlim((0, 512))
plt.ylabel('Position')
plt.colorbar()
plt.show()
`````` ``````print(qd.vocab)
``````
``````  (' ', ':', '|', 'x', 'y', '=', ',', '+', '-', '*', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9')
``````

We continue with defining our own shared Keras “base” layer.

At this time, we only store a reference to our parameters instance in it, as all our layers will need to use this resource:

``````class Layer(kl.Layer):
def __init__(self, ps, **kw):
kw.setdefault('dtype', tf.float32)
super().__init__(**kw)
self.ps = ps
``````

## Embedding layer

The `Embed` layer is taken directly from our previous blog.

As we need to add the above mentioned positional “timing signals” to our embeddings, we first create just such a `constant` tensor.

Then, using the previously mentioned `RaggedTensor`-technique of extracting the ragged-shaped values from a dense tensor, we simply add the positional info to the already determined embedding:

``````class Embed(Layer):
def __init__(self, ps):
super().__init__(ps)
s = (ps.dim_vocab, ps.dim_hidden)
p = pos_timing(ps.len_max_input, ps.dim_hidden)
p = tf.constant(p, dtype=tf.float32)
self.pos = tf.broadcast_to(p, [ps.dim_batch] + p.shape[1:])

def call(self, x):
fv, rs = x
x = tf.RaggedTensor.from_row_splits(fv, rs)
y = tf.ragged.map_flat_values(tf.nn.embedding_lookup, self.emb, x)
y += tf.RaggedTensor.from_tensor(self.pos, lengths=y.row_lengths())
return y
``````

## Encode and decode stacks

The next layers to write are the `Encode` and `Decode` stacks.

We implement them as simple lists of the respective `Encoder` and `Decoder` components. When calling the stacks, the code simply loops through the component lists and calls the components.

In order to “chain” the stacks, every component is given the output of the previous component as its input.

In the case of the `Decoder` components, and in addition to their regular inputs, we also supply the previously encoded output as their `ctx` argument:

``````class Encode(Layer):
def __init__(self, ps):
super().__init__(ps)
self.encs = [Encoder(self, f'enc_{i}') for i in range(ps.dim_stacks)]

def call(self, x):
y = x
for e in self.encs:
y, ctx = e(y)
return [y, ctx]

class Decode(Layer):
def __init__(self, ps):
super().__init__(ps)
self.decs = [Decoder(self, f'dec_{i}') for i in range(ps.dim_stacks)]

def call(self, x):
y, ctx = x
for d in self.decs:
y = d([y, ctx])
return y
``````

## “Debed” layer

The mirror “image” of the `Embed` layer is our `Debed` layer.

While the embedding step maps `int` tokens to higher-dimensional, learned `float` values, the “debedding” step does the opposite. It takes the higher-dimensional values and maps them to learned `one-hot` vectors, corresponding to approximate output tokens.

As debedding is implemented using a `Dense` component, it also requires a fixed width. Just as in the previous blog, we simply pad our ragged tensors with `0`s to our `len_max_input` parameter, as our calculations are complete and the raggedness is not needed any longer:

``````class Debed(Layer):
def __init__(self, ps):
super().__init__(ps)
self.max_len = u = ps.len_max_input
s = [u * ps.dim_hidden, ps.dim_vocab]
self.dbd = Dense(self, 'dbd', s)

def call(self, x):
y = x.to_tensor()
s = tf.shape(y)
y = tf.pad(y, [[0, 0], [0, self.max_len - s[-2]], [0, 0]])
y = tf.reshape(y, [-1, self.max_len * s[-1]])
y = self.dbd(y)
return y
``````

## Lightweight Modules

We have now completed the definition of our top layers as Keras layers, but we still need to define the inner components.

We could continue using the seemingly “heavy” Keras layers and nest them deeper. Instead, as presented in a previous blog, we switch over to using the much lighter-weight `Module` as the base class for our inner components.

Our `Encoder` thus becomes a simple module containing the self-attention followed by the feed-forward mechanisms. We fittingly call the inner modules `reflect` and `conclude`.

Our `Decoder` also adds the attention layer looking at the previously encoded “context”. Hence, the module encapsulating this attention component is called `consider`:

``````class Encoder(tf.Module):
def __init__(self, layer, name=None):
super().__init__(name=name)
with self.name_scope:
self.reflect = Attention(layer, 'refl')
self.conclude = Conclusion(layer, 'conc')

@tf.Module.with_name_scope
def __call__(self, x):
y, ctx = self.reflect([x, None])
y = self.conclude(y)
return [y, ctx]

class Decoder(tf.Module):
def __init__(self, layer, name=None):
super().__init__(name=name)
with self.name_scope:
self.reflect = Attention(layer, 'refl')
self.consider = Attention(layer, 'cnsd')
self.conclude = Conclusion(layer, 'conc')

@tf.Module.with_name_scope
def __call__(self, x):
x, ctx = x
y, _ = self.reflect([x, None])
y, _ = self.consider([y, ctx])
y = self.conclude(y)
return y
``````

## Attention Module

Our `Attention` component, again based on `Module`, is taken directly from the previous blog. As explained there, it relies on and takes advantage of the new `RaggedTensor`s:

``````class Attention(tf.Module):
def __init__(self, layer, name):
super().__init__(name=name)
h = layer.ps.dim_hidden
self.scale = 1 / (h**0.5)
with self.name_scope:

@tf.Module.with_name_scope
def __call__(self, x):
x, ctx = x
q = x.with_values(tf.einsum('ni,ij->nj', x.flat_values, self.q))
k = x.with_values(tf.einsum('ni,ij->nj', x.flat_values, self.k))
v = x.with_values(tf.einsum('ni,ij->nj', x.flat_values, self.v))
y = tf.einsum('bsi,bzi->bsz', q.to_tensor(), k.to_tensor())
y = tf.nn.softmax(y * self.scale)
y = tf.einsum('bsz,bzi->bsi', y, v.to_tensor())
y = tf.RaggedTensor.from_tensor(y, lengths=x.row_lengths())
return [y, tf.constant(1)]
``````

A new component is our `Conclusion` module. It implements the “feed-forward” functionality of the transformer.

In simple terms, it takes the attention-enhanced, higher-dimensional, element-wise mapping of the token sequence and it first `inflates` it to an even higher dimension with a non-linearity, or `activation`, at the end as its “concluding” step.

Then it `deflates` the activated mapping back to our hidden dimension, making it available for the next level in the stack.

The same `RaggedTensor` trick, as the one we used in the `Attention` module, applies at the end:

``````class Conclusion(tf.Module):
def __init__(self, layer, name):
super().__init__(name=name)
ps = layer.ps
self.max_len = w = ps.len_max_input
w *= ps.dim_hidden
with self.name_scope:
s = [w, ps.dim_dense]
self.inflate = Dense(layer, 'infl', s, activation='relu')
s = [ps.dim_dense, w]
self.deflate = Dense(layer, 'defl', s, bias=False)

@tf.Module.with_name_scope
def __call__(self, x):
y = x.to_tensor()
s = tf.shape(y)
y = tf.pad(y, [[0, 0], [0, self.max_len - s[-2]], [0, 0]])
y = tf.reshape(y, [-1, self.max_len * s[-1]])
y = self.inflate(y)
y = self.deflate(y)
y = tf.reshape(y, [-1, self.max_len, s[-1]])
y = tf.RaggedTensor.from_tensor(y, lengths=x.row_lengths())
return y
``````

Our last component is the `Dense` module.

It simply re-implements the Keras layer with the same name, yet with more focused, streamlined functionality and minimal configurability.

The interesting aspect of this module, just as our `Attention` module above, is that the necessarily created Keras weights are added using the enclosing Keras layer, however, topologically, they are directly listed as part of their respective modules:

``````class Dense(tf.Module):
bias = None
activation = None

def __init__(self, layer, name, shape, activation=None, bias=True):
super().__init__(name=name)
with self.name_scope:
kw = dict(shape=shape, initializer='glorot_uniform')
if bias:
kw.update(shape=shape[1:], initializer='zeros')
self.activation = ks.activations.get(activation)

@tf.Module.with_name_scope
def __call__(self, x):
y = tf.einsum('bi,ij->bj', x, self.kern)
if self.bias is not None:
if self.activation:
y = self.activation(y)
return y
``````

## Training session

And now we are ready to define our model.

We have the two inputs, the two components of our input `RaggedTensor`.

We also use our new `Embed`, `Encode`, `Decode` and `Debed` Keras layers, with all the internal, light-weight modules hidden at this level.

The rest of the model is simply carried over from the previous blog:

``````def model_for(ps):
x = [ks.Input(shape=(), dtype='int32'), ks.Input(shape=(), dtype='int64')]
y = Embed(ps)(x)
y = Encode(ps)(y)
y = Decode(ps)(y)
y = Debed(ps)(y)
m = ks.Model(inputs=x, outputs=y)
m.compile(optimizer=ps.optimizer, loss=ps.loss, metrics=[ps.metric])
print(m.summary())
return m
``````

Our parameters need to be adjusted to provide parametric values for our stacks:

``````params = dict(
dim_batch=2,
dim_dense=150,
dim_hidden=6,
dim_stacks=2,
dim_vocab=len(qd.vocab),
len_max_input=20,
loss=ks.losses.SparseCategoricalCrossentropy(from_logits=True),
metric=ks.metrics.SparseCategoricalAccuracy(),
num_epochs=5,
num_shards=2,
)
``````

By firing up our training session, we can confirm the model’s layers and connections. The listing of a short session follows.

We can easily adjust the parameters to tailor the length of the sessions to our objectives. However, at this point the results are still largely meaningless and extending the trainings is not yet warranted.

``````ps = qd.Params(**params)
qm.main_graph(ps, qr.dset_for(ps), model_for(ps))
``````
``````  Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
input_5 (InputLayer)            [(None,)]            0
__________________________________________________________________________________________________
input_6 (InputLayer)            [(None,)]            0
__________________________________________________________________________________________________
embed_2 (Embed)                 (None, None, 6)      120         input_5
input_6
__________________________________________________________________________________________________
encode_2 (Encode)               [(None, None, None), 72516       embed_2
__________________________________________________________________________________________________
decode_2 (Decode)               (None, None, None)   72732       encode_2
encode_2
__________________________________________________________________________________________________
debed_2 (Debed)                 (None, 20)           2420        decode_2
==================================================================================================
Total params: 147,788
Trainable params: 147,788
Non-trainable params: 0
__________________________________________________________________________________________________
None
Epoch 1/5
1000/1000 [==============================] - 22s 22ms/step - loss: 1.8823 - sparse_categorical_accuracy: 0.4925
Epoch 2/5
1000/1000 [==============================] - 6s 6ms/step - loss: 1.6305 - sparse_categorical_accuracy: 0.5040
Epoch 3/5
1000/1000 [==============================] - 5s 5ms/step - loss: 1.5499 - sparse_categorical_accuracy: 0.5390
Epoch 4/5
1000/1000 [==============================] - 5s 5ms/step - loss: 1.5049 - sparse_categorical_accuracy: 0.5510
Epoch 5/5
1000/1000 [==============================] - 6s 6ms/step - loss: 1.4616 - sparse_categorical_accuracy: 0.5680
``````

With our TensorBoard `callback` in place, the model’s `fit` method will generate the standard summaries that TB can conveniently visualize.

If you haven’t run the code below, an already generated graph is here.

``````#%load_ext tensorboard
#%tensorboard --logdir /tmp/q/logs
``````

## Eager execution mode

We can also switch over to the new `eager` execution mode.

This is particularly convenient for experimentation, as all ops are immediately executed. Here is a much shortened `eager` session:

``````ps.num_epochs = 1
qr.main_eager(ps, qr.dset_for(ps).take(100), model_for(ps))
``````
``````  Model: "model_3"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
input_7 (InputLayer)            [(None,)]            0
__________________________________________________________________________________________________
input_8 (InputLayer)            [(None,)]            0
__________________________________________________________________________________________________
embed_3 (Embed)                 (None, None, 6)      120         input_7
input_8
__________________________________________________________________________________________________
encode_3 (Encode)               [(None, None, None), 72516       embed_3
__________________________________________________________________________________________________
decode_3 (Decode)               (None, None, None)   72732       encode_3
encode_3
__________________________________________________________________________________________________
debed_3 (Debed)                 (None, 20)           2420        decode_3
==================================================================================================
Total params: 147,788
Trainable params: 147,788
Non-trainable params: 0
__________________________________________________________________________________________________
None
Step: 10 , loss: 2.24212742 , xent: 0.56386137
Step: 20 , loss: 2.33669519 , xent: 0.559803903
Step: 30 , loss: 2.9666779 , xent: 0.556796134
Step: 40 , loss: 3.00279307 , xent: 0.556730747
Step: 50 , loss: 3.11257172 , xent: 0.555238068
Step: 60 , loss: 1.19533181 , xent: 0.553301871
Step: 70 , loss: 2.49692106 , xent: 0.552803755
Step: 80 , loss: 1.94691455 , xent: 0.553240716
Step: 90 , loss: 3.86864901 , xent: 0.552752316
Step: 100 , loss: 2.69638801 , xent: 0.551818192
Epoch 0 loss: 2.696388 , xent: 0.5518182
``````

This concludes our blog, please see how to further customize our model by clicking on the next blog.