TensorFlow was the new kid on the block when it was introduced in 2015 and has become the most used deep learning framework last year. I jumped on the train a few months after the first release and began my journey into deep learning during my master's thesis. It took a while to get used to the computation graph and session model, but since then I've got my head around most of the quirks and twists.
This short article is no introduction to TensorFlow, but instead offers some quick tips, mostly focused on performance, that reveal common pitfalls and may boost your model and training performance to new levels. We'll start with preprocessing and your input pipeline, visit graph construction and move on to debugging and performance optimizations.
Preprocessing and input pipelines
Keep preprocessing clean and lean
Are you baffled at how long it takes to train your relatively simple model? Check your preprocessing! If you're doing any heavy preprocessing like transforming data to neural network inputs, those can significantly slow down your inference speed. In my case I was creating so-called 'distance maps', grayscale images used in "Deep Interactive Object Selection" as additional inputs, using a custom python function. My training speed topped out at around 2.4 images per second even when I switched to a much more powerful GTX 1080. I then noticed the bottleneck and after applying my fix I was able to train at around 50 images per second.
If you notice such a bottleneck the usual first impulse is to optimize the code. But a much more effective way to strip away computation time from your training pipeline is to move the preprocessing into a one-time operation that generates TFRecord files. Your heavy preprocessing is only done once to create TFRecords for all your training data and your pipeline boils down to loading the records. Even if you want to introduce some kind of randomness to augment your data, its worth to think about creating the different variations once instead of bloating your pipeline.
Watch your queues
A way to notice expensive preprocessing pipelines are the queue graphs in Tensorboard. These are generated automatically if you use the frameworks QueueRunners and store the summaries in a file. The graphs show if your machine was able to keep the queues filled. If you notice negative spikes in the graphs your system is unable to generate new data in the time your machine wants to process one batch. One of the reasons for this was already discussed in the previous section. The most common reason in my experience is large
min_after_dequeue values. If your queues try to keep lots of records in memory, they can easily saturate your capacities, which leads to swapping and slows down your queues significantly. Other reasons could be hardware issues like too slow disks or just larger data than your system can handle. Whatever it is, fixing it will speed up your training process.
Graph construction and training
Finalize your graph
TensorFlows separate graph construction and graph computation model is quite rare in day to day programming and can cause some confusion for beginners. This applies to bugs and error messages, which can occur in the code for the first time when the graph is built, and then again when it's actually evaluated, which is counterintuitive when you are used to code being evaluated just once.
Another issue is graph construction in combination with training loops. These loops are usually 'standard' python loops and can therefore alter the graph and add new operations to it. Altering a graph while continuously evaluating it will create a major performance loss, but is rather hard to notice at first. Thankfully there is an easy fix. Just finalize your graph before starting your training loop by calling
tf.getDefaultGraph().finalize(). This will lock the graph and any attempts to add a new operation will throw an error. Exactly what we want.
Profile your graph
A less prominently advertised feature of TensorFlow is profiling. There is a mechanism to record run times and memory consumption of your graphs operations. This can come in handy if you are looking for bottlenecks or need to find out if a model can be trained on your machine without swapping to the hard drive.
To generate profiling data you need to perform a single run through your graph with tracing enabled:
# Collect tracing information during the fifth step. if global_step == 5: # Create an object to hold the tracing data run_metadata = tf.RunMetadata() # Run one step and collect the tracing data _, loss = sess.run([train_op, loss_op], options=tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE), run_metadata=run_metadata) # Add summary to the summary writer summary_writer.add_run_metadata(run_metadata, 'step%d', global_step)
timeline.json file is saved to the current folder and the tracing data become available in Tensorboard. You can now easily see, how long an operation takes to compute and how much memory it consumes. Just open the graph view in Tensorboard, select your latest run on the left and you should see performance details on the right. On the one hand, this allows you to adjust your model in order to use your machine as much as possible, on the other hand, it lets you find bottlenecks in your training pipeline. If you prefer a timeline view, you can load the
timeline.json file in Google Chromes Trace Event Profiling Tool.
Another nice tool is tfprof, which makes use of the same functionality for memory and execution time profiling, but offers more convenience features. Additional statistics require code changes.
Watch your memory
Profiling, as explained in the previous section, allows you to keep an eye on the memory usage of particular operations, but watching your whole models memory consumption is even more important. Always make sure, that you don't exceed your machine's memory, as swapping will most certainly slow down your input pipeline and your GPU starts waiting for new data. A simple
top or, as explained in one of the previous sections, the queue graphs in Tensorboard should be sufficient for detecting such behavior. Detailed investigation can then be done using the aforementioned tracing.
Print is your friend
My main tool for debugging issues like stagnating loss or strange outputs is
tf.Print. Due to the nature of neural networks, looking at the raw values of tensors inside of your model usually doesn't make much sense. Nobody can interpret millions of floating point numbers and see whats wrong. But especially printing out shapes or mean values can give great insights. If you are trying to implement some existing model, this allows you to compare your model's values to the ones in the paper or article and can help you solve tricky issues or expose typos in papers.
With TensorFlow 1.0 we have been given the new TFDebugger, which looks very promising. I haven't used it yet, but will definitely try it out in the coming weeks.
Set an operation execution timeout
You have implemented your model, launch your session and nothing happens? This is usually caused by empty queues, but if you have no idea, which queue could be responsible for the mishap there is an easy fix: Just enable the operation execution timeout when creating your session and your script will crash when an operation exceeds your limit:
config = tf.ConfigProto() config.operation_timeout_in_ms=5000 sess = tf.Session(config=config)
Using the stack trace you can then find out, which op causes your headache, fix the error and train on.
I hope I could help some of my fellow TensorFlow coders. If you found an error, have more tips or just want to get in touch, please send me an email!