Porting code with deep learning

Davi de Castro Reis
9 min readAug 29, 2018

If you studied machine learning in college or through some tutorials in the internet, you probably have this feeling that the problems solved by it are all similar in some sense. Detecting people in photos, classifying sentiment in a text. It always feels like some task that an unhappy employee is doing in a far away country.

But when the theory books start to discuss how some machine learning technique captures any function, this often feels a bit disconnected from reality. Those tasks are not that close to what we computer scientists usually describe as functions. There are some nice articles out there that can help fixing that misunderstanding, but in this one I will just show that you can also use machine learning to reproduce some boring programming function using the same techniques that a machine learning expert uses to learn how to execute those mechanical turk like tasks. And no, I will not use xor as my example.

The function I will use is Python’s urllib.parse.urlparse. And I will do it without ever looking at its source code. My equivalent learned function can easily be ported to java, or c++, or any platform with some support for machine learning models. And funnily enough, if the source code of urlparse gets lost, we will still have this learned black box with its functionality.

I will also take this a bit further, and after learning how to parse urls, we will choose a few arbitrary functions, like hex, and use machine learning to copy them using the same architecture from urlparse, modifying a single parameter in a function call to achieve this. We will even create a model for the inverse of a function, actually learning a function that no one ever wrote code for.

And to make things even more concrete, I will finally take those functions we translated from python to a machine learning model and deploy them in the browser, using tensorflowjs as the platform. It is as is tensorflow is the new bytecode. I can save my model and suddenly I can have a RPC service built on top of tensorflow-serving that knows how to parse urls, or I can use the model as a transformer in a spark-ml pipeline, or I could use its intermediate state as part of a search engine. With predictable latency and memory requirements.

It is knowledge captured in matrices. It is Software 2.0.

For all these tasks, I chose to use Google Colab with its free GPU offering and I detail each step in the sections below. First, learn how to install Apache Spark with anonymous s3 access to access training data. Then see how to prepare your data for both the development cycle and learning production models. See the powerful seq2seq+attention model we use to learn arbitrary functions, and apply it to copy existing functionality from python libraries. Finally, deploy your models in the browser.

Setting up Apache Spark

As our first step , we will install Apache Spark. That requires a working jdk and plus a manual hadoop installation, since we cannot use the bundled hadoop versions due to lack of proper anonymous s3 support. We use the shell access support to complete these installations and glue them together using the SPARK_DIST_CLASSPATH env var.

Now we need to access our installation in python, and there is already a tutorial on how to setup Apache Spark on Google Colaboratory. But we need extra steps to access AWS s3 files. That is why we needed a 2.8.x hadoop version, otherwise we cannot access public s3 files. See below a modified version of the tutorial that takes care of this. As a bonus, we setup the new Apache Arrow support to get faster conversion to pandas.

Finally, we can read anonymous data in s3. There are many public datasets one can access in AWS, and they are now starting to offer data in parquet format, the most convenient columnar storage out there. The code snippet below reads some urls and other metadata from the multi-petabyte CommonCrawl repository, ready to be used in your machine learning investigations.

Data wrangling

The usual machine learning toy example assumes you can load all data in memory, and usually does that in a nice pandas dataframe, from where one can derive all sorts of useful charts and statistics. However, python and pandas in general are quite memory hungry, and even loading a million urls in memory is more than enough to blow up the limits. In our case, we would like to potentially access all training data we have available, a meager 144M urls. Let us use spark to get a handle to all the data, but for now we will load only a small sample of that into pandas.

In this situation, usually people end up with two separate pieces of code, one more debugging oriented, and another more production oriented, and sometimes they diverge creating quite a pain for the machine learning practitioner. According to a friend, the term for this problem is “training serving skew” and there is no magical solution. After all the goals of extracting the last bits of performance and memory are at odds with the idea of keeping extra data around and doing extra computations for debugging.

Here we carefully organized our code to create a good compromise. First, let us work in the boundaries of the model, doing what is usually called vectorization. On the input side, we need code to take strings and create vectors, and in the output side, we need to be able to convert the one-hot encoded output representation into a string. We can test this part of the code by using to_categorical as if it was a learned identity function, and see if we can reproduce the input after it goes through it.

Now we write carefully thought out numpy code which we will use to process each batch, and on top of this we overlay pandas transformations for visualization and analysis. The key here is the insight that doing batch level transformation suffices to leverage numpy vectorization gains, and still allows for reasonable memory consumption when moving small collections of batches into the pandas world. In particular, using locals() is a great trick to automatically derive your pandas column names. The code below shows the efficient code to generate a dictionary with the features ready to be used in training.

And the next block of code shows how we can chain a simple function to capture the same information in a memory hungry but nice to see pandas dataframe.

We do one extra step in the data preparation to wrap our data in a generator. The reason for that is that a full pass in our data would take too long, and we would be looking at sad empty graphs in Tensorboard because its curves only get updated on each epoch. Furthermore, when you consider that you have infinite training data, defining what is your training data is somewhat ill defined and arbitrarily defining what is an epoch makes sense.

Learning

In this step, we assemble and train the neural network which mimics urlparse. The architecture is an almost vanilla version of the sequence classification network. We have a pair of embedding layer to represent each byte of the input or output, a pair of LSTM layers to encode the network state as it traverses each url and its the so-far-parsed output, and an attention mechanism to make sure we can handle large sequences.

One tricky part when using tensorflow under keras is the usage of global graphs and sessions, which is already confusing in tensorflow, but even more when you use keras because the abstractions are more hidden and harder to manipulate. In particular, carefully calling K.clear_session() is key to avoid leaking state as you execute your training cell multiple times.

The other gotcha here that cannot be ignored is that Keras layers GRU and LSTM do not really use the GPU. You need to use the CuDNN version of these layers, released with Tensorflow 1.9, to see a meaningful improvement from using this hardware. With that in mind, let us start training a simple model to test our code. In this first step, we will get a simpler problem, defining the protocol given the first 32 characters of a url, and we will use the small dev data from a pandas dataframe, with no validation data and only the default instrumentation from keras. In this first moment, the important thing is to see if our code does not blow up and that the model starts to overfit, ideally in a few minutes. We will add a few more bells and whistles after that, and spend more time training for more complex problems.

The seq2seq architecture is able to infer the next character given an input and the output generated so far. In training, the full output is available, but during inference, we need to do it character by character. This is called decoding, and you get several helpers functions in tensorflow for that. Here, we implement it in an very inefficient but clear numpy code. And we can finally see the results of our trained development model.

Learning for real

Our toy model trained quickly and achieved perfect accuracy. However, it simply has not seen enough data to generalize, and is working in a very simple problem. Really learning to parse urls will require more data and more time. For that, let us add some extra instrumentation, using tensorboard to follow learning, and a custom callback so we actually see the parsing attempts at every epoch as the model converges. These callbacks are particularly useful when you have some problem preventing your model from learning.

Because this time our goal is not to simply overfit the model, we need validation data. In the code below we just get the first batch of the training data, and we use it. Extra batches or a generator itself could be better, but Keras does not play nice with these more sophisticated options.

Also, notice that the performance improvement of the CuDNN layers is over an order of magnitude, but it comes at the price of not supporting the mask_zeroflag. That means that for batches where there is a very long sequence, the learning process and the loss function will be mostly concerned about the padded zeros in all other inputs of the batch. Luckily, since we define the padding length per batch, not over all the collection, the problem is mostly contained. Still, if you really care about it, you could use the loss function below, which caps the contribution of the loss by the padding to half of the loss. You still lose the computing cycles for learning how to generate those zeros, but the CuDNN performance gains far out weight that slow down.

With these tools in place, we can now train our copy of the url parser function. Because we organized our code having in mind both the development and production cycle, we can now train this complex model using almost the same code as the one that we used for the simple protocol extracting model.

And with a bit less than an hour training, we now have a url parser with perfect accuracy. Let us see the results when we parse some of the urls in our dev data.

Reusing the architecture

The architecture we developed knows how to receive bytes in the input, and how to generate bytes in the output. It has nothing specific about parsing URLs, and in fact we can use to copy virtually any python function that takes a string and whose output can be represented as json. Let us try it.

More impressively, we not only can copy python code that exists, but we can use our architecture to learn the inverse of these functions, essentially learning about functions that no one ever wrote code for. Below you can see the code we used to define a my_date_format function, which display human friendly dates for humans who like hexadecimal, and how we created a table to define its inverse, and learned a model who can parse dates like that. For the learning to finish, it would have taken a few hours, but you can see that with a short amount of time we were able to get my birth datetime parsed correctly with a day off.

In the next installment, we will see how to move the predictfunction out of python code and into our tensorflow graph. With that, our general sequence learning architecture can be deployed in any environment where tensorflow runs. And we will actually show how to do it in tensorflowjs.

--

--