In this blog, we continue to elaborate on our loosely transformer-based approach to training our computer to “understand” elementary symbolic arithmetic. Just as previously mentioned, a quick and fun intro to the venerable transformer is here.
Our objective is to expand our layers’ repertoire to include the encoder and the decoder stacks, as well as to ultimately arrive at a model representable by the graph and runnable example.
We need to prep our environment to run any meaningful code:
import numpy as np
import tensorflow as tf
import dataset as qd
import ragged as qr
ks = tf.keras
kl = ks.layers
Before we start to focus on our stacks, an important feature of encoding textual inputs needs to be considered. To aid in making sense of a text, we need to include not just the information carried by the tokens themselves but also their position in the input sequence.
Following the “positional encoding” approach from here, we can define our pos_timing function as follows.
The quick graphical plot helps us confirm the correctness of our code as the concatenated sin and cos timing signals give us finely graded and rich positional embeddings:
def pos_timing(width, depth):
assert depth % 2 == 0
d = np.arange(depth)[np.newaxis, :]
d = 1 / np.power(10000, (2 * (d // 2)) / np.float32(depth))
t = np.arange(width)[:, np.newaxis] * d
t = [np.sin(t[:, 0::2]), np.cos(t[:, 1::2])]
t = np.concatenate(t, axis=-1)[np.newaxis, ...]
return t
pos = pos_timing(50, 512)
import matplotlib.pyplot as plt
plt.pcolormesh(pos[0], cmap='RdBu')
plt.xlabel('Depth')
plt.xlim((0, 512))
plt.ylabel('Position')
plt.colorbar()
plt.show()

Loading our already created meta data from the source files gives us:
print(qd.vocab)
(' ', ':', '|', 'x', 'y', '=', ',', '+', '-', '*', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9')
We continue with defining our own shared Keras “base” layer.
At this time, we only store a reference to our parameters instance in it, as all our layers will need to use this resource:
class Layer(kl.Layer):
def __init__(self, ps, **kw):
kw.setdefault('dtype', tf.float32)
super().__init__(**kw)
self.ps = ps
Embedding layer
The Embed layer is taken directly from our previous blog.
As we need to add the above mentioned positional “timing signals” to our embeddings, we first create just such a constant tensor.
Then, using the previously mentioned RaggedTensor-technique of extracting the ragged-shaped values from a dense tensor, we simply add the positional info to the already determined embedding:
class Embed(Layer):
def __init__(self, ps):
super().__init__(ps)
s = (ps.dim_vocab, ps.dim_hidden)
self.emb = self.add_weight(name='emb', shape=s)
p = pos_timing(ps.len_max_input, ps.dim_hidden)
p = tf.constant(p, dtype=tf.float32)
self.pos = tf.broadcast_to(p, [ps.dim_batch] + p.shape[1:])
def call(self, x):
fv, rs = x
x = tf.RaggedTensor.from_row_splits(fv, rs)
y = tf.ragged.map_flat_values(tf.nn.embedding_lookup, self.emb, x)
y += tf.RaggedTensor.from_tensor(self.pos, lengths=y.row_lengths())
return y
Encode and decode stacks
The next layers to write are the Encode and Decode stacks.
We implement them as simple lists of the respective Encoder and Decoder components. When calling the stacks, the code simply loops through the component lists and calls the components.
In order to “chain” the stacks, every component is given the output of the previous component as its input.
In the case of the Decoder components, and in addition to their regular inputs, we also supply the previously encoded output as their ctx argument:
class Encode(Layer):
def __init__(self, ps):
super().__init__(ps)
self.encs = [Encoder(self, f'enc_{i}') for i in range(ps.dim_stacks)]
def call(self, x):
y = x
for e in self.encs:
y, ctx = e(y)
return [y, ctx]
class Decode(Layer):
def __init__(self, ps):
super().__init__(ps)
self.decs = [Decoder(self, f'dec_{i}') for i in range(ps.dim_stacks)]
def call(self, x):
y, ctx = x
for d in self.decs:
y = d([y, ctx])
return y
“Debed” layer
The mirror “image” of the Embed layer is our Debed layer.
While the embedding step maps int tokens to higher-dimensional, learned float values, the “debedding” step does the opposite. It takes the higher-dimensional values and maps them to learned one-hot vectors, corresponding to approximate output tokens.
As debedding is implemented using a Dense component, it also requires a fixed width. Just as in the previous blog, we simply pad our ragged tensors with 0s to our len_max_input parameter, as our calculations are complete and the raggedness is not needed any longer:
class Debed(Layer):
def __init__(self, ps):
super().__init__(ps)
self.max_len = u = ps.len_max_input
s = [u * ps.dim_hidden, ps.dim_vocab]
self.dbd = Dense(self, 'dbd', s)
def call(self, x):
y = x.to_tensor()
s = tf.shape(y)
y = tf.pad(y, [[0, 0], [0, self.max_len - s[-2]], [0, 0]])
y = tf.reshape(y, [-1, self.max_len * s[-1]])
y = self.dbd(y)
return y
Lightweight Modules
We have now completed the definition of our top layers as Keras layers, but we still need to define the inner components.
We could continue using the seemingly “heavy” Keras layers and nest them deeper. Instead, as presented in a previous blog, we switch over to using the much lighter-weight Module as the base class for our inner components.
Our Encoder thus becomes a simple module containing the self-attention followed by the feed-forward mechanisms. We fittingly call the inner modules reflect and conclude.
Our Decoder also adds the attention layer looking at the previously encoded “context”. Hence, the module encapsulating this attention component is called consider:
class Encoder(tf.Module):
def __init__(self, layer, name=None):
super().__init__(name=name)
with self.name_scope:
self.reflect = Attention(layer, 'refl')
self.conclude = Conclusion(layer, 'conc')
@tf.Module.with_name_scope
def __call__(self, x):
y, ctx = self.reflect([x, None])
y = self.conclude(y)
return [y, ctx]
class Decoder(tf.Module):
def __init__(self, layer, name=None):
super().__init__(name=name)
with self.name_scope:
self.reflect = Attention(layer, 'refl')
self.consider = Attention(layer, 'cnsd')
self.conclude = Conclusion(layer, 'conc')
@tf.Module.with_name_scope
def __call__(self, x):
x, ctx = x
y, _ = self.reflect([x, None])
y, _ = self.consider([y, ctx])
y = self.conclude(y)
return y
Attention Module
Our Attention component, again based on Module, is taken directly from the previous blog. As explained there, it relies on and takes advantage of the new RaggedTensors:
class Attention(tf.Module):
def __init__(self, layer, name):
super().__init__(name=name)
h = layer.ps.dim_hidden
self.scale = 1 / (h**0.5)
with self.name_scope:
self.q = layer.add_weight('q', shape=(h, h))
self.k = layer.add_weight('k', shape=(h, h))
self.v = layer.add_weight('v', shape=(h, h))
@tf.Module.with_name_scope
def __call__(self, x):
x, ctx = x
q = x.with_values(tf.einsum('ni,ij->nj', x.flat_values, self.q))
k = x.with_values(tf.einsum('ni,ij->nj', x.flat_values, self.k))
v = x.with_values(tf.einsum('ni,ij->nj', x.flat_values, self.v))
y = tf.einsum('bsi,bzi->bsz', q.to_tensor(), k.to_tensor())
y = tf.nn.softmax(y * self.scale)
y = tf.einsum('bsz,bzi->bsi', y, v.to_tensor())
y = tf.RaggedTensor.from_tensor(y, lengths=x.row_lengths())
return [y, tf.constant(1)]
A new component is our Conclusion module. It implements the “feed-forward” functionality of the transformer.
In simple terms, it takes the attention-enhanced, higher-dimensional, element-wise mapping of the token sequence and it first inflates it to an even higher dimension with a non-linearity, or activation, at the end as its “concluding” step.
Then it deflates the activated mapping back to our hidden dimension, making it available for the next level in the stack.
The same RaggedTensor trick, as the one we used in the Attention module, applies at the end:
class Conclusion(tf.Module):
def __init__(self, layer, name):
super().__init__(name=name)
ps = layer.ps
self.max_len = w = ps.len_max_input
w *= ps.dim_hidden
with self.name_scope:
s = [w, ps.dim_dense]
self.inflate = Dense(layer, 'infl', s, activation='relu')
s = [ps.dim_dense, w]
self.deflate = Dense(layer, 'defl', s, bias=False)
@tf.Module.with_name_scope
def __call__(self, x):
y = x.to_tensor()
s = tf.shape(y)
y = tf.pad(y, [[0, 0], [0, self.max_len - s[-2]], [0, 0]])
y = tf.reshape(y, [-1, self.max_len * s[-1]])
y = self.inflate(y)
y = self.deflate(y)
y = tf.reshape(y, [-1, self.max_len, s[-1]])
y = tf.RaggedTensor.from_tensor(y, lengths=x.row_lengths())
return y
Our last component is the Dense module.
It simply re-implements the Keras layer with the same name, yet with more focused, streamlined functionality and minimal configurability.
The interesting aspect of this module, just as our Attention module above, is that the necessarily created Keras weights are added using the enclosing Keras layer, however, topologically, they are directly listed as part of their respective modules:
class Dense(tf.Module):
bias = None
activation = None
def __init__(self, layer, name, shape, activation=None, bias=True):
super().__init__(name=name)
with self.name_scope:
kw = dict(shape=shape, initializer='glorot_uniform')
self.kern = layer.add_weight('kern', **kw)
if bias:
kw.update(shape=shape[1:], initializer='zeros')
self.bias = layer.add_weight('bias', **kw)
self.activation = ks.activations.get(activation)
@tf.Module.with_name_scope
def __call__(self, x):
y = tf.einsum('bi,ij->bj', x, self.kern)
if self.bias is not None:
y = tf.nn.bias_add(y, self.bias)
if self.activation:
y = self.activation(y)
return y
Training session
And now we are ready to define our model.
We have the two inputs, the two components of our input RaggedTensor.
We also use our new Embed, Encode, Decode and Debed Keras layers, with all the internal, light-weight modules hidden at this level.
The rest of the model is simply carried over from the previous blog:
def model_for(ps):
x = [ks.Input(shape=(), dtype='int32'), ks.Input(shape=(), dtype='int64')]
y = Embed(ps)(x)
y = Encode(ps)(y)
y = Decode(ps)(y)
y = Debed(ps)(y)
m = ks.Model(inputs=x, outputs=y)
m.compile(optimizer=ps.optimizer, loss=ps.loss, metrics=[ps.metric])
print(m.summary())
return m
Our parameters need to be adjusted to provide parametric values for our stacks:
params = dict(
dim_batch=2,
dim_dense=150,
dim_hidden=6,
dim_stacks=2,
dim_vocab=len(qd.vocab),
len_max_input=20,
loss=ks.losses.SparseCategoricalCrossentropy(from_logits=True),
metric=ks.metrics.SparseCategoricalAccuracy(),
num_epochs=5,
num_shards=2,
optimizer=ks.optimizers.Adam(),
)
By firing up our training session, we can confirm the model’s layers and connections. The listing of a short session follows.
We can easily adjust the parameters to tailor the length of the sessions to our objectives. However, at this point the results are still largely meaningless and extending the trainings is not yet warranted.
ps = qd.Params(**params)
import masking as qm
qm.main_graph(ps, qr.dset_for(ps), model_for(ps))
Model: "model_2"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_5 (InputLayer) [(None,)] 0
__________________________________________________________________________________________________
input_6 (InputLayer) [(None,)] 0
__________________________________________________________________________________________________
embed_2 (Embed) (None, None, 6) 120 input_5[0][0]
input_6[0][0]
__________________________________________________________________________________________________
encode_2 (Encode) [(None, None, None), 72516 embed_2[0][0]
__________________________________________________________________________________________________
decode_2 (Decode) (None, None, None) 72732 encode_2[0][0]
encode_2[0][1]
__________________________________________________________________________________________________
debed_2 (Debed) (None, 20) 2420 decode_2[0][0]
==================================================================================================
Total params: 147,788
Trainable params: 147,788
Non-trainable params: 0
__________________________________________________________________________________________________
None
Epoch 1/5
1000/1000 [==============================] - 22s 22ms/step - loss: 1.8823 - sparse_categorical_accuracy: 0.4925
Epoch 2/5
1000/1000 [==============================] - 6s 6ms/step - loss: 1.6305 - sparse_categorical_accuracy: 0.5040
Epoch 3/5
1000/1000 [==============================] - 5s 5ms/step - loss: 1.5499 - sparse_categorical_accuracy: 0.5390
Epoch 4/5
1000/1000 [==============================] - 5s 5ms/step - loss: 1.5049 - sparse_categorical_accuracy: 0.5510
Epoch 5/5
1000/1000 [==============================] - 6s 6ms/step - loss: 1.4616 - sparse_categorical_accuracy: 0.5680
With our TensorBoard callback in place, the model’s fit method will generate the standard summaries that TB can conveniently visualize.
If you haven’t run the code below, an already generated graph is here.
#%load_ext tensorboard
#%tensorboard --logdir /tmp/q/logs
Eager execution mode
We can also switch over to the new eager execution mode.
This is particularly convenient for experimentation, as all ops are immediately executed. Here is a much shortened eager session:
ps.num_epochs = 1
qr.main_eager(ps, qr.dset_for(ps).take(100), model_for(ps))
Model: "model_3"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_7 (InputLayer) [(None,)] 0
__________________________________________________________________________________________________
input_8 (InputLayer) [(None,)] 0
__________________________________________________________________________________________________
embed_3 (Embed) (None, None, 6) 120 input_7[0][0]
input_8[0][0]
__________________________________________________________________________________________________
encode_3 (Encode) [(None, None, None), 72516 embed_3[0][0]
__________________________________________________________________________________________________
decode_3 (Decode) (None, None, None) 72732 encode_3[0][0]
encode_3[0][1]
__________________________________________________________________________________________________
debed_3 (Debed) (None, 20) 2420 decode_3[0][0]
==================================================================================================
Total params: 147,788
Trainable params: 147,788
Non-trainable params: 0
__________________________________________________________________________________________________
None
Step: 10 , loss: 2.24212742 , xent: 0.56386137
Step: 20 , loss: 2.33669519 , xent: 0.559803903
Step: 30 , loss: 2.9666779 , xent: 0.556796134
Step: 40 , loss: 3.00279307 , xent: 0.556730747
Step: 50 , loss: 3.11257172 , xent: 0.555238068
Step: 60 , loss: 1.19533181 , xent: 0.553301871
Step: 70 , loss: 2.49692106 , xent: 0.552803755
Step: 80 , loss: 1.94691455 , xent: 0.553240716
Step: 90 , loss: 3.86864901 , xent: 0.552752316
Step: 100 , loss: 2.69638801 , xent: 0.551818192
Epoch 0 loss: 2.696388 , xent: 0.5518182
This concludes our blog, please see how to further customize our model by clicking on the next blog.