
.. _advanced_theano:

***************
Advanced Theano
***************

Conditions
----------
**IfElse**

- Build condition over symbolic variables.
- IfElse Op takes a boolean condition and two variables to compute as input.
- While Switch Op evaluates both 'output' variables, IfElse Op is lazy and only
  evaluates one variable respect to the condition.

**IfElse Example: Comparison with Switch**

.. testcode::

   from theano import tensor as T
   from theano.ifelse import ifelse
   import theano, time, numpy

   a,b = T.scalars('a','b')
   x,y = T.matrices('x','y')

   z_switch = T.switch(T.lt(a,b), T.mean(x), T.mean(y))
   z_lazy = ifelse(T.lt(a,b), T.mean(x), T.mean(y))

   f_switch = theano.function([a,b,x,y], z_switch,
                              mode=theano.Mode(linker='vm'))
   f_lazyifelse = theano.function([a,b,x,y], z_lazy,
                                  mode=theano.Mode(linker='vm'))

   val1 = 0.
   val2 = 1.
   big_mat1 = numpy.ones((10000,1000))
   big_mat2 = numpy.ones((10000,1000))

   n_times = 10

   tic = time.clock()
   for i in range(n_times):
       f_switch(val1, val2, big_mat1, big_mat2)
   print('time spent evaluating both values %f sec' % (time.clock()-tic))

   tic = time.clock()
   for i in range(n_times):
       f_lazyifelse(val1, val2, big_mat1, big_mat2)
   print('time spent evaluating one value %f sec' % (time.clock()-tic))

.. testoutput::
   :hide:
   :options: +ELLIPSIS

   time spent evaluating both values ... sec
   time spent evaluating one value ... sec

IfElse Op spend less time (about an half) than Switch since it computes only
one variable instead of both.

.. code-block:: none

  $ python ifelse_switch.py
  time spent evaluating both values 0.6700 sec
  time spent evaluating one value 0.3500 sec

Note that IfElse condition is a boolean while Switch condition is a tensor, so
Switch is more general.

It is actually important to use  ``linker='vm'`` or ``linker='cvm'``,
otherwise IfElse will compute both variables and take the same computation
time as the Switch Op. The linker is not currently set by default to 'cvm' but
it will be in a near future.

Loops
-----

**Scan**

- General form of **recurrence**, which can be used for looping.
- **Reduction** and **map** (loop over the leading dimensions) are special cases of Scan
- You 'scan' a function along some input sequence, producing an output at each time-step
- The function can see the **previous K time-steps** of your function
- ``sum()`` could be computed by scanning the z + x(i) function over a list, given an initial state of ``z=0``.
- Often a for-loop can be expressed as a ``scan()`` operation, and ``scan`` is the closest that Theano comes to looping.
- The advantage of using ``scan`` over for loops

  - The number of iterations to be part of the symbolic graph
  - Minimizes GPU transfers if GPU is involved
  - Compute gradients through sequential steps
  - Slightly faster then using a for loop in Python with a compiled Theano function
  - Can lower the overall memory usage by detecting the actual amount of memory needed

**Scan Example: Computing pow(A,k)**

.. code-block:: python

  import theano
  import theano.tensor as T

  k = T.iscalar("k"); A = T.vector("A")

  def inner_fct(prior_result, A): return prior_result * A
  # Symbolic description of the result
  result, updates = theano.scan(fn=inner_fct,
                              outputs_info=T.ones_like(A),
                              non_sequences=A, n_steps=k)

  # Scan has provided us with A**1 through A**k.  Keep only the last
  # value. Scan notices this and does not waste memory saving them.
  final_result = result[-1]

  power = theano.function(inputs=[A,k], outputs=final_result,
                        updates=updates)

  print power(range(10),2)
  #[  0.   1.   4.   9.  16.  25.  36.  49.  64.  81.]


**Scan Example: Calculating a Polynomial**

.. testcode::

  import numpy
  import theano
  import theano.tensor as T

  coefficients = theano.tensor.vector("coefficients")
  x = T.scalar("x"); max_coefficients_supported = 10000

  # Generate the components of the polynomial
  full_range=theano.tensor.arange(max_coefficients_supported)
  components, updates = theano.scan(fn=lambda coeff, power, free_var:
                                     coeff * (free_var ** power),
                                  outputs_info=None,
                                  sequences=[coefficients, full_range],
                                  non_sequences=x)
  polynomial = components.sum()
  calculate_polynomial = theano.function(inputs=[coefficients, x],
                                       outputs=polynomial)

  test_coeff = numpy.asarray([1, 0, 2], dtype=numpy.float32)
  print(calculate_polynomial(test_coeff, 3))

.. testoutput::

  19.0



Exercise 4
-----------

- Run both examples
- Modify and execute the polynomial example to have the reduction done by scan



Compilation pipeline
--------------------

.. image:: ../hpcs2011_tutorial/pics/pipeline.png
   :width: 400 px

Inplace optimization
--------------------

- 2 type of inplace operations:

  - An op that return a view on its inputs (e.g. reshape, inplace transpose)
  - An op that write the output on the inputs memory space

- This allows some memory optimization
- The Op must tell Theano if they work inplace
- Inplace Op add constraints to the order of execution


Profiling
---------

- To replace the default mode with this mode, use the Theano flags ``profile=True``

- To enable the memory profiling use the flags ``profile=True,profile_memory=True``

Theano output for running the train function of logistic regression
example from :doc:`here <../tutorial/examples>` for one epoch:

.. code-block:: python

    """
    Function profiling
    ==================
      Message: train.py:47
      Time in 1 calls to Function.__call__: 5.981922e-03s
      Time in Function.fn.__call__: 5.180120e-03s (86.596%)
      Time in thunks: 4.213095e-03s (70.430%)
      Total compile time: 3.739440e-01s
        Number of Apply nodes: 21
        Theano Optimizer time: 3.258998e-01s
           Theano validate time: 5.632162e-03s
        Theano Linker time (includes C, CUDA code generation/compiling): 3.185582e-02s
           Import time 3.157377e-03s

    Time in all call to theano.grad() 2.997899e-02s
    Time since theano import 3.616s
    Class
    ---
    <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
      50.6%    50.6%       0.002s       1.07e-03s     Py       2       2   theano.tensor.basic.Dot
      27.2%    77.8%       0.001s       5.74e-04s     C        2       2   theano.sandbox.cuda.basic_ops.HostFromGpu
      18.1%    95.9%       0.001s       3.81e-04s     C        2       2   theano.sandbox.cuda.basic_ops.GpuFromHost
       2.6%    98.6%       0.000s       1.23e-05s     C        9       9   theano.tensor.elemwise.Elemwise
       0.8%    99.3%       0.000s       3.29e-05s     C        1       1   theano.sandbox.cuda.basic_ops.GpuElemwise
       0.3%    99.6%       0.000s       5.60e-06s     C        2       2   theano.tensor.elemwise.DimShuffle
       0.2%    99.8%       0.000s       6.91e-06s     C        1       1   theano.sandbox.cuda.basic_ops.GpuDimShuffle
       0.1%    99.9%       0.000s       5.01e-06s     C        1       1   theano.compile.ops.Shape_i
       0.1%   100.0%       0.000s       5.01e-06s     C        1       1   theano.tensor.elemwise.Sum
       ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

    Ops
    ---
    <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
      50.6%    50.6%       0.002s       1.07e-03s     Py       2        2   dot
      27.2%    77.8%       0.001s       5.74e-04s     C        2        2   HostFromGpu
      18.1%    95.9%       0.001s       3.81e-04s     C        2        2   GpuFromHost
       1.0%    97.0%       0.000s       4.39e-05s     C        1        1   Elemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}
       0.8%    97.7%       0.000s       3.29e-05s     C        1        1   GpuElemwise{Sub}[(0, 1)]
       0.4%    98.1%       0.000s       1.50e-05s     C        1        1   Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]
       0.3%    98.4%       0.000s       5.60e-06s     C        2        2   InplaceDimShuffle{x}
       0.3%    98.6%       0.000s       1.10e-05s     C        1        1   Elemwise{ScalarSigmoid}[(0, 0)]
       0.2%    98.8%       0.000s       9.06e-06s     C        1        1   Elemwise{Composite{(i0 - (i1 * (i2 + (i3 * i0))))}}[(0, 0)]
       0.2%    99.0%       0.000s       7.15e-06s     C        1        1   Elemwise{gt,no_inplace}
       0.2%    99.2%       0.000s       6.91e-06s     C        1        1   Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]
       0.2%    99.3%       0.000s       6.91e-06s     C        1        1   GpuDimShuffle{1,0}
       0.2%    99.5%       0.000s       6.91e-06s     C        1        1   Elemwise{neg,no_inplace}
       0.1%    99.6%       0.000s       5.96e-06s     C        1        1   Elemwise{Composite{((-i0) - i1)}}[(0, 0)]
       0.1%    99.8%       0.000s       5.01e-06s     C        1        1   Elemwise{Cast{float64}}
       0.1%    99.9%       0.000s       5.01e-06s     C        1        1   Shape_i{0}
       0.1%   100.0%       0.000s       5.01e-06s     C        1        1   Sum{acc_dtype=float64}
       ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

    Apply
    ------
    <% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops> <Gflops/s> <Apply name>
      26.8%    26.8%       0.001s       1.13e-03s      1     1                     dot(x, w)
        input 0: dtype=float32, shape=(400, 784), strides=c
        input 1: dtype=float64, shape=(784,), strides=c
        output 0: dtype=float64, shape=(400,), strides=c
      26.5%    53.4%       0.001s       1.12e-03s      1    10                     HostFromGpu(GpuDimShuffle{1,0}.0)
        input 0: dtype=float32, shape=(784, 400), strides=(1, 784)
        output 0: dtype=float32, shape=(784, 400), strides=c
      23.8%    77.1%       0.001s       1.00e-03s      1    18                     dot(x.T, Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)].0)
        input 0: dtype=float32, shape=(784, 400), strides=c
        input 1: dtype=float64, shape=(400,), strides=c
        output 0: dtype=float64, shape=(784,), strides=c
       9.6%    86.7%       0.000s       4.04e-04s      1     3                     GpuFromHost(y)
        input 0: dtype=float32, shape=(400,), strides=c
        output 0: dtype=float32, shape=(400,), strides=(1,)
       8.5%    95.2%       0.000s       3.58e-04s      1     2                     GpuFromHost(x)
        input 0: dtype=float32, shape=(400, 784), strides=c
        output 0: dtype=float32, shape=(400, 784), strides=(784, 1)
       1.0%    96.3%       0.000s       4.39e-05s      1    13                     Elemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}(y, Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, HostFromGpu.0, Elemwise{neg,no_inplace}.0)
        input 0: dtype=float32, shape=(400,), strides=c
        input 1: dtype=float64, shape=(400,), strides=c
        input 2: dtype=float64, shape=(1,), strides=c
        input 3: dtype=float32, shape=(400,), strides=c
        input 4: dtype=float64, shape=(400,), strides=c
        output 0: dtype=float64, shape=(400,), strides=c
       0.8%    97.1%       0.000s       3.29e-05s      1     7                     GpuElemwise{Sub}[(0, 1)](CudaNdarrayConstant{[ 1.]}, GpuFromHost.0)
        input 0: dtype=float32, shape=(1,), strides=c
        input 1: dtype=float32, shape=(400,), strides=(1,)
        output 0: dtype=float32, shape=(400,), strides=c
       0.7%    97.7%       0.000s       2.91e-05s      1    11                     HostFromGpu(GpuElemwise{Sub}[(0, 1)].0)
        input 0: dtype=float32, shape=(400,), strides=c
        output 0: dtype=float32, shape=(400,), strides=c
       0.4%    98.1%       0.000s       1.50e-05s      1    15                     Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)](Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, y, Elemwise{Cast{float64}}.0, Elemwise{ScalarSigmoid}[(0, 0)].0, HostFromGpu.0)
        input 0: dtype=float64, shape=(400,), strides=c
        input 1: dtype=float64, shape=(1,), strides=c
        input 2: dtype=float32, shape=(400,), strides=c
        input 3: dtype=float64, shape=(1,), strides=c
        input 4: dtype=float64, shape=(400,), strides=c
        input 5: dtype=float32, shape=(400,), strides=c
        output 0: dtype=float64, shape=(400,), strides=c
       0.3%    98.4%       0.000s       1.10e-05s      1    14                     Elemwise{ScalarSigmoid}[(0, 0)](Elemwise{neg,no_inplace}.0)
        input 0: dtype=float64, shape=(400,), strides=c
        output 0: dtype=float64, shape=(400,), strides=c
       0.2%    98.6%       0.000s       9.06e-06s      1    20                     Elemwise{Composite{(i0 - (i1 * (i2 + (i3 * i0))))}}[(0, 0)](w, TensorConstant{(1,) of 0...0000000149}, dot.0, TensorConstant{(1,) of 0...9999999553})
        input 0: dtype=float64, shape=(784,), strides=c
        input 1: dtype=float64, shape=(1,), strides=c
        input 2: dtype=float64, shape=(784,), strides=c
        input 3: dtype=float64, shape=(1,), strides=c
        output 0: dtype=float64, shape=(784,), strides=c
       0.2%    98.7%       0.000s       7.15e-06s      1    16                     Elemwise{gt,no_inplace}(Elemwise{ScalarSigmoid}[(0, 0)].0, TensorConstant{(1,) of 0.5})
        input 0: dtype=float64, shape=(400,), strides=c
        input 1: dtype=float32, shape=(1,), strides=c
        output 0: dtype=int8, shape=(400,), strides=c
       0.2%    98.9%       0.000s       7.15e-06s      1     0                     InplaceDimShuffle{x}(b)
        input 0: dtype=float64, shape=(), strides=c
        output 0: dtype=float64, shape=(1,), strides=c
       0.2%    99.1%       0.000s       6.91e-06s      1    19                     Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)](b, TensorConstant{0.10000000149}, Sum{acc_dtype=float64}.0)
        input 0: dtype=float64, shape=(), strides=c
        input 1: dtype=float64, shape=(), strides=c
        input 2: dtype=float64, shape=(), strides=c
        output 0: dtype=float64, shape=(), strides=c
       0.2%    99.2%       0.000s       6.91e-06s      1     9                     Elemwise{neg,no_inplace}(Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0)
        input 0: dtype=float64, shape=(400,), strides=c
        output 0: dtype=float64, shape=(400,), strides=c
       0.2%    99.4%       0.000s       6.91e-06s      1     6                     GpuDimShuffle{1,0}(GpuFromHost.0)
        input 0: dtype=float32, shape=(400, 784), strides=(784, 1)
        output 0: dtype=float32, shape=(784, 400), strides=(1, 784)
       0.1%    99.5%       0.000s       5.96e-06s      1     5                     Elemwise{Composite{((-i0) - i1)}}[(0, 0)](dot.0, InplaceDimShuffle{x}.0)
        input 0: dtype=float64, shape=(400,), strides=c
        input 1: dtype=float64, shape=(1,), strides=c
        output 0: dtype=float64, shape=(400,), strides=c
       0.1%    99.7%       0.000s       5.01e-06s      1    17                     Sum{acc_dtype=float64}(Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)].0)
        input 0: dtype=float64, shape=(400,), strides=c
        output 0: dtype=float64, shape=(), strides=c
       0.1%    99.8%       0.000s       5.01e-06s      1    12                     Elemwise{Cast{float64}}(InplaceDimShuffle{x}.0)
        input 0: dtype=int64, shape=(1,), strides=c
        output 0: dtype=float64, shape=(1,), strides=c
       0.1%    99.9%       0.000s       5.01e-06s      1     4                     Shape_i{0}(y)
        input 0: dtype=float32, shape=(400,), strides=c
        output 0: dtype=int64, shape=(), strides=c
       ... (remaining 1 Apply instances account for 0.10%(0.00s) of the runtime)

    Memory Profile
    (Sparse variables are ignored)
    (For values in brackets, it's for linker = c|py
    ---
        Max if no gc (allow_gc=False): 2469KB (2469KB)
        CPU: 1242KB (1242KB)
        GPU: 1227KB (1227KB)
    ---
        Max if linker=cvm(default): 2466KB (2464KB)
        CPU: 1241KB (1238KB)
        GPU: 1225KB (1227KB)
    ---
        Memory saved if views are used: 1225KB (1225KB)
        Memory saved if inplace ops are used: 17KB (17KB)
        Memory saved if gc is enabled: 3KB (4KB)
    ---

        <Sum apply outputs (bytes)> <Apply outputs shape> <created/inplace/view> <Apply node>

           1254400B  [(400, 784)] c GpuFromHost(x)
           1254400B  [(784, 400)] v GpuDimShuffle{1,0}(GpuFromHost.0)
           1254400B  [(784, 400)] c HostFromGpu(GpuDimShuffle{1,0}.0)
              6272B  [(784,)] c dot(x.T, Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)].0)
              6272B  [(784,)] i Elemwise{Composite{(i0 - (i1 * (i2 + (i3 * i0))))}}[(0, 0)](w, TensorConstant{(1,) of 0...0000000149}, dot.0, TensorConstant{(1,) of 0...9999999553})
              3200B  [(400,)] c dot(x, w)
              3200B  [(400,)] i Elemwise{Composite{((-i0) - i1)}}[(0, 0)](dot.0, InplaceDimShuffle{x}.0)
              3200B  [(400,)] i Elemwise{ScalarSigmoid}[(0, 0)](Elemwise{neg,no_inplace}.0)
              3200B  [(400,)] c Elemwise{neg,no_inplace}(Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0)
              3200B  [(400,)] i Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)](Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, y, Elemwise{Cast{float64}}.0, Elemwise{ScalarSigmoid}[(0, 0)].0, HostFromGpu.0)
              3200B  [(400,)] c Elemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}(y, Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, HostFromGpu.0, Elemwise{neg,no_inplace}.0)
              1600B  [(400,)] i GpuElemwise{Sub}[(0, 1)](CudaNdarrayConstant{[ 1.]}, GpuFromHost.0)
              1600B  [(400,)] c HostFromGpu(GpuElemwise{Sub}[(0, 1)].0)
              1600B  [(400,)] c GpuFromHost(y)
       ... (remaining 7 Apply account for  448B/3800192B ((0.01%)) of the Apply with dense outputs sizes)

        <created/inplace/view> is taken from the Op's declaration.
        Apply nodes marked 'inplace' or 'view' may actually allocate memory, this is not reported here. If you use DebugMode, warnings will be emitted in those cases.

    Here are tips to potentially make your code run faster
                     (if you think of new ones, suggest them on the mailing list).
                     Test them first, as they are not guaranteed to always provide a speedup.
      Sorry, no tip for today.
    """

Exercise 5
-----------

- In the last exercises, do you see a speed up with the GPU?
- Where does it come from? (Use profile=True)
- Is there something we can do to speed up the GPU version?


Printing/Drawing Theano graphs
------------------------------

Consider the following logistic regression model:

>>> import numpy
>>> import theano
>>> import theano.tensor as T
>>> rng = numpy.random
>>> # Training data
>>> N = 400
>>> feats = 784
>>> D = (rng.randn(N, feats).astype(theano.config.floatX), rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
>>> training_steps = 10000
>>> # Declare Theano symbolic variables
>>> x = T.matrix("x")
>>> y = T.vector("y")
>>> w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
>>> b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
>>> x.tag.test_value = D[0]
>>> y.tag.test_value = D[1]
>>> # Construct Theano expression graph
>>> p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b)) # Probability of having a one
>>> prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
>>> # Compute gradients
>>> xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy
>>> cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
>>> gw,gb = T.grad(cost, [w,b])
>>> # Training and prediction function
>>> train = theano.function(inputs=[x,y], outputs=[prediction, xent], updates=[[w, w-0.01*gw], [b, b-0.01*gb]], name = "train")
>>> predict = theano.function(inputs=[x], outputs=prediction, name = "predict")

We will now make use of Theano's printing features to compare the unoptimized
graph (``prediction``) to the optimized graph (``predict``).


Pretty Printing
~~~~~~~~~~~~~~~

>>> theano.printing.pprint(prediction) # doctest: +NORMALIZE_WHITESPACE
'gt((TensorConstant{1} / (TensorConstant{1} + exp(((-(x \\dot w)) - b)))), TensorConstant{0.5})'


Debug Print
~~~~~~~~~~~

The graph before optimization:

>>> theano.printing.debugprint(prediction) # doctest: +NORMALIZE_WHITESPACE, +SKIP
    Elemwise{gt,no_inplace} [@A] ''
    |Elemwise{true_div,no_inplace} [@B] ''
    | |DimShuffle{x} [@C] ''
    | | |TensorConstant{1} [@D]
    | |Elemwise{add,no_inplace} [@E] ''
    |   |DimShuffle{x} [@F] ''
    |   | |TensorConstant{1} [@D]
    |   |Elemwise{exp,no_inplace} [@G] ''
    |     |Elemwise{sub,no_inplace} [@H] ''
    |       |Elemwise{neg,no_inplace} [@I] ''
    |       | |dot [@J] ''
    |       |   |x [@K]
    |       |   |w [@L]
    |       |DimShuffle{x} [@M] ''
    |         |b [@N]
    |DimShuffle{x} [@O] ''
      |TensorConstant{0.5} [@P]

The graph after optimization:

>>> theano.printing.debugprint(predict) # doctest: +NORMALIZE_WHITESPACE, +SKIP
    Elemwise{Composite{GT(scalar_sigmoid((-((-i0) - i1))), i2)}} [@A] ''   4
     |CGemv{inplace} [@B] ''   3
     | |Alloc [@C] ''   2
     | | |TensorConstant{0.0} [@D]
     | | |Shape_i{0} [@E] ''   1
     | |   |x [@F]
     | |TensorConstant{1.0} [@G]
     | |x [@F]
     | |w [@H]
     | |TensorConstant{0.0} [@D]
     |InplaceDimShuffle{x} [@I] ''   0
     | |b [@J]
     |TensorConstant{(1,) of 0.5} [@K]


Picture Printing of Graphs
~~~~~~~~~~~~~~~~~~~~~~~~~~
``pydotprint`` requires graphviz and pydot.

The graph before optimization:

>>> theano.printing.pydotprint(prediction, outfile="pics/logreg_pydotprint_prediction.png", var_with_name_simple=True)  # doctest: +SKIP
The output file is available at pics/logreg_pydotprint_prediction.png

.. image:: ./pics/logreg_pydotprint_prediction.png
   :width: 800 px

The graph after optimization:

>>> theano.printing.pydotprint(predict, outfile="pics/logreg_pydotprint_predict.png", var_with_name_simple=True)  # doctest: +SKIP
The output file is available at pics/logreg_pydotprint_predict.png

.. image:: ./pics/logreg_pydotprint_predict.png
   :width: 800 px

The optimized training graph:

>>> theano.printing.pydotprint(train, outfile="pics/logreg_pydotprint_train.png", var_with_name_simple=True)  # doctest: +SKIP
The output file is available at pics/logreg_pydotprint_train.png

.. image:: ./pics/logreg_pydotprint_train.png
   :width: 1500 px


Debugging
---------

- Run with the flag ``mode=DebugMode``

  - 100-1000x slower
  - Test all optimization steps from the original graph to the final graph
  - Checks many things that Op should/shouldn't do
  - Executes both the Python and C code versions

- Run with the Theano flag ``compute_test_value = {``off'',``ignore'', ``warn'', ``raise''}``

  - Run the code as we create the graph
  - Allows you to find the bug earlier (ex: shape mismatch)
  - Makes it easier to identify where the problem is in *your* code
  - Use the value of constants and shared variables directly
  - For pure symbolic variables uses ``x.tag.test_value = numpy.random.rand(5,10)``

- Run with the flag ``mode=FAST_COMPILE``

  - Few optimizations
  - Run Python code (better error messages and can be debugged interactively in the Python debugger)

Known limitations
-----------------

- Compilation phase distinct from execution phase
- Compilation time can be significant

  - Amortize it with functions over big input or reuse functions

- Execution overhead

  - Needs a certain number of operations to be useful
  - We have started working on this in a branch

- Compilation time superlinear in the size of the graph.

  - A few hundreds nodes is fine
  - Disabling a few optimizations can speed up compilation
  - Usually too many nodes indicates a problem with the graph
