feed-forward-neural-network-tensorflow slides

Feed Forward Neural Networks¶

In [2]:

from IPython.display import Image
Image(filename='./pics/Deeplearning 2.png')

Out[2]:

Eine Schicht von Neuronen:

Input: $\vec x^T = (x_1, x_2, \dots x_n)$

Output des j-ten Neurons: $$ h_j = \sigma(\sum_{i=1}^n w_{ij} x_i + b_j) $$

Mit der Gewichtsmatrix $W$:

$$ \vec h = \sigma(\vec x \cdot W + \vec b) $$

Elementweise Anwendung der Funktion $\sigma$ auf die Vektorelemente.

Trainingsdaten - nicht "linear separabel"¶

In [6]:

plot_train_data(X_train[t_train==0], X_train[t_train==1])

Händisch konstruierte Merkmalstransformation¶

$$ \phi_1(\vec x) = x_1^2 $$$$ \phi_2(\vec x) = x_2^2 $$

Im konstruierten Merkmalsraum sind die Daten linear-separabel.

In [8]:

phi_train = X_train**2
plot_train_transformed(phi_train[t_train==0], phi_train[t_train==1])

Lernen der Transformation¶

Mit Neuronalen Netzen können Feature-Transformationen gelernt werden.

Der Aktivitätsvektor der "Hidden-Layers" kann als transformierter Input-Vektor interpretiert werden.

Feed Forward Neural Network mit einer verdeckten Schicht¶

Aktivität der (ersten) Hidded-Layer:

$$ \vec h^{(1)} = \sigma_1 \left(\vec x \cdot W^{(1)} + \vec b^{(1)} \right) $$

Aktivität des Output-Neurons $o$:

$$ o = h^{(2)}= \sigma_2 \left( \vec h^{(1)} \cdot W^{(2)} + b^{(2)} \right) $$

In [15]:

# (first) hidden layer
a = tf.matmul(x, W_h) + b_h

# activity function "rectified linear units"
h = tf.nn.relu(a)

# output neuron:
y = logistic_function(tf.matmul(h, W_o) + b_o)

Kostenfunktion und l2-Regulierung¶

In [16]:

cross_entropy = -tf.reduce_sum(y_*tf.log(y[:,0]) + (1.-y_)*tf.log(1.-y[:,0]))

l2_reg = tf.reduce_sum(tf.square(W_h)) + tf.reduce_sum(tf.square(W_o))

lambda_ = 0.002
cost = cross_entropy + lambda_ * l2_reg

In [22]:

# Note: there is also a visualization tool called TensorBoard, not used here.
plt.plot(range(epochs), cost_, '-b')
plt.xlabel('Iterations')
plt.ylabel('Cost')
cost_

Out[22]:

array([ 44.88797379,  44.88481522,  44.88169098, ...,   6.09101915,
         6.09098625,   6.09095526])

Entscheidungsfläche (Decision Boundary)¶

In [27]:

plot_contour(X_train[t_train==0], X_train[t_train==1], 'train data')

In [28]:

plot_contour(X_test[t_test==0], X_test[t_test==1], 'test data')

Anteil der korrekten Vorhersagen der Testdaten (accuracy)¶

In [29]:

y_e = y.eval(feed_dict={x: X_test})
((y_e>0.5).reshape(-1)==t_test).mean()

Out[29]:

0.84999999999999998

In [31]:

plot_hidden_neurons_activities()

In jedem schraffierten Gebiet ist die Anzahl der aktiven "Hidden Neurons" konstant.
Zwei hintereinandergeschaltete affine Transformationen entsprechen einer affinen Transformation gefolgt von der logistischen Funktion. D.h. in jedem schraffierten Gebiet verhält sich das Neuronale Netz wie die "logistische Regression": Hyperebene als Entscheidungsgrenze.
An den Grenzen der schraffierten Gebiete kommt man von einer Entscheidungshyperebene zur nächsten.

siehe auch [Mon14, Pas14]

Übung:¶

Variieren Sie die Anzahl der Hidden-Neurone, $\lambda$ und die Anzahl der Trainingsdaten?
Wie manifestiert sich dabei Underfitting und Overfitting?

In [34]:

X = np.concatenate((x_0, x_1))
t = np.concatenate((np.zeros(len(x_0)), np.ones(len(x_1))))


coarse = np.arange(-3,4,1.)
fine = np.arange(-3,3.01,.01)
x_grid = np.meshgrid(coarse, fine)
y_grid = np.meshgrid(fine, coarse)

grid_points_x = np.concatenate((x_grid[0].reshape(-1,1), x_grid[1].reshape(-1,1)), axis=1)
grid_points_y = np.concatenate((y_grid[0].reshape(-1,1), y_grid[1].reshape(-1,1)), axis=1)
grid_points = np.concatenate((grid_points_x, grid_points_y), axis=0)

x_e = np.arange(-x_ext*0.5, x_ext*0.5, 0.01)

blue = np.concatenate((x_e.reshape(-1,1),(x_e**2 + blue_b).reshape(-1,1)), axis=1)
red = np.concatenate((x_e.reshape(-1,1),(x_e**2 + red_b).reshape(-1,1)), axis=1)

# for slides:
def plot_data():
    plt.figure(figsize=(7,7))
    plt.scatter(x_0[:,0], x_0[:,1], label='Class 0', color='b') 
    plt.scatter(x_1[:,0], x_1[:,1], label='Class 1', color='r') 
    plt.xlabel("$x_1$")
    plt.ylabel("$x_2$")

    plt.scatter(x_grid[0].reshape(-1),x_grid[1].reshape(-1), marker='.', color='k', lw=0., s=1) 
    plt.scatter(y_grid[0].reshape(-1),y_grid[1].reshape(-1), marker='.', color='k', lw=0., s=1) 

    plt.scatter(blue[:,0], blue[:,1], color='b', s=1, lw=0) 
    plt.scatter(red[:,0], red[:,1], color='r', s=1, lw=0)

Feature-Transformation¶

In [35]:

plot_data() # each point is a training example

In [37]:

lambda_ = .2             
epochs=4000 
starter_alpha = 0.005 

cross_entropy = -tf.reduce_sum(y_*tf.log(y[:,0]) + (1.-y_)*tf.log(1.-y[:,0]))
l2_reg = tf.reduce_sum(tf.square(W_h)) + tf.reduce_sum(tf.square(W_h))

cost = cross_entropy + lambda_ * l2_reg

# Decaying the learning rate:
# http://www.tensorflow.org/api_docs/python/train.html#exponential_decay
global_step = tf.Variable(0, trainable=False)
    
alpha = tf.train.exponential_decay(starter_alpha, global_step,
                                   100, 0.96, staircase=True)

optimizer = tf.train.GradientDescentOptimizer(alpha)

# Passing global_step to minimize() will increment it at each step.
train_step = optimizer.minimize(cost, global_step=global_step)

sess.run(tf.initialize_all_variables())

cost_ = np.ndarray(epochs)
for j in range(epochs):
    #full online learning
    cost_[j] = cost.eval(feed_dict={x: X, y_: t})
    train_step.run(feed_dict={x: X, y_: t})

In [38]:

plt.plot(range(epochs), cost_, '-b')
plt.xlabel('Iterations')
plt.ylabel('Cost')

Out[38]:

<matplotlib.text.Text at 0x11af2e9b0>

In [39]:

#train classification error
y_e = y.eval(feed_dict={x: X})
((y_e>0.5).reshape(-1)==t).mean()

Out[39]:

1.0

In [40]:

def plot_hidden_space(h=h, x=x, grid_points=None, x_0=x_0, x_1=x_1, W_o=W_o, b_o = b_o, blue=None, red=None):
    w_o=W_o.eval()  
    b_o_=b_o.eval()  
    fig = plt.figure(figsize=(8,8))

    if grid_points is not None:
        grid_hidden = h.eval(feed_dict={x: grid_points})
        plt.scatter(grid_hidden[:,0], grid_hidden[:,1], marker='.', color='k', lw=0., s=2) 
    
    h_0 = h.eval(feed_dict={x: x_0})
    h_1 = h.eval(feed_dict={x: x_1})
    plt.scatter(h_0[:,0], h_0[:,1],label='Class 0', color='b') 
    plt.scatter(h_1[:,0], h_1[:,1],label='Class 1', color='r') 
    
    if (blue is not None) and (red is not None):
        blue_hidden = h.eval(feed_dict={x: blue})
        red_hidden = h.eval(feed_dict={x: red})
        plt.scatter(blue_hidden[:,0], blue_hidden[:,1], color='b', lw=0., s=2) 
        plt.scatter(red_hidden[:,0], red_hidden[:,1], color='r', lw=0., s=2) 

    # decision boundary
    plt.plot(np.arange(-1.1,1.2,0.1), - (w_o[0] * np.arange(-1.1,1.2,0.1) + b_o_)/w_o[1],'g-')

    # for tanh units- TODO
    plt.xlim(-1.1, 1.1)
    plt.ylim(-1.1, 1.1)

Transformierter Merkmalsraum - Aktivitäten der verdeckten Neurone¶

Grüne Line: Entscheidungsgrenze im Merkmalsraum

In [41]:

plot_hidden_space(h=h, x=x, grid_points=grid_points, x_0=x_0, x_1=x_1, W_o=W_o, b_o = b_o, blue=blue, red=red)

"Spiral Data" mit drei Klassen¶

In [68]:

# spiral data
# adapted from from http://cs231n.github.io/neural-networks-case-study/

N = 50 # number of points per class
D = 2 # dimensionality
K = 3 # number of classes
f = 1

X = np.zeros((N*K,D)) # data matrix (each row = single example)
targets= np.zeros(N*K, dtype='uint8') # class labels
for j in range(K):
  ix = range(N*j, N*(j+1))
  r = np.linspace(0.1, 1., N) # radius
  tt = np.linspace(j*2,(j+2)*2,N) + np.random.randn(N)*0.02 # theta
  X[ix] = np.c_[r*np.sin(tt*f), r*np.cos(tt*f)]
  targets[ix] = j

coarse = np.arange(-1.,1.1,.5)
fine = np.arange(-1,1.01,.01)

x_grid = np.meshgrid(coarse, fine)
y_grid = np.meshgrid(fine, coarse)
grid_points_x = np.concatenate((x_grid[0].reshape(-1,1), x_grid[1].reshape(-1,1)), axis=1)
grid_points_y = np.concatenate((y_grid[0].reshape(-1,1), y_grid[1].reshape(-1,1)), axis=1)
grid_points = np.concatenate((grid_points_x, grid_points_y), axis=0)

def plot_spiral_data():
    plt.figure(figsize=(8,8))
    plt.scatter(X[:, 0], X[:, 1], c=targets, s=40, cmap=plt.cm.Spectral)
    plt.scatter(x_grid[0].reshape(-1),x_grid[1].reshape(-1), marker='.', color='k', lw=0., s=2) 
    plt.scatter(y_grid[0].reshape(-1),y_grid[1].reshape(-1), marker='.', color='k', lw=0., s=2)

In [69]:

plot_spiral_data()

Feed Forward Neural Network mit zwei verdeckten Schichten¶

Aktivitäten der Hidded-Layer:

$$ \vec h^{(1)} = \sigma_1 \left(\vec x \cdot W^{(1)} + \vec b^{(1)} \right) $$$$ \vec h^{(2)}= \sigma_2 \left( \vec h^{(1)} \cdot W^{(2)} + \vec b^{(2)} \right) $$

Aktivität der Output-Neuronen $\vec o$:

$$ \vec o = \vec h^{(3)}= \sigma_3 \left( \vec h^{(2)} \cdot W^{(3)} + \vec b^{(3)} \right) $$

Softmax¶

$$ \sigma_3 (\vec a) = \frac{\exp(\vec a)}{\sum_i\exp(a_i)} $$

mit

$\exp(\vec a)$: elementweise Anwendung, sodass das Ergebnis ein Vektor mit der gleichen Dimensionalität ist.

In [72]:

lambda_ = 0.0001          
epochs=8000 
starter_alpha = 0.001

# cross entropy for one hoe encoding:
cross_entropy = -tf.reduce_sum(y_*tf.log(y))

l2_reg = tf.reduce_sum(tf.square(W_h2)) + tf.reduce_sum(tf.square(W_h)) + tf.reduce_sum(tf.square(W_o))

cost = cross_entropy + lambda_ * l2_reg

# Decaying the learning rate:
# http://www.tensorflow.org/api_docs/python/train.html#exponential_decay
global_step = tf.Variable(0, trainable=False)
    
alpha = tf.train.exponential_decay(starter_alpha, global_step,
                                   100, 0.96, staircase=True)

optimizer = tf.train.GradientDescentOptimizer(alpha)

# Passing global_step to minimize() will increment it at each step.
train_step = optimizer.minimize(cost, global_step=global_step)

sess.run(tf.initialize_all_variables())

cost_ = np.ndarray(epochs)
for j in range(epochs):
    #full online learning
    cost_[j] = cost.eval(feed_dict={x: X, y_: t})
    train_step.run(feed_dict={x: X, y_: t})

In [73]:

plt.plot(range(epochs), cost_, '-b')
plt.xlabel('Iterations')
plt.ylabel('Cost')

Out[73]:

<matplotlib.text.Text at 0x11b37ea20>

In [75]:

# as function for slides
def plot_spriral_db():    
    fig = plt.figure(figsize=(8,8))

    # decision boundary
    #plt.plot(np.arange(-1.1,1.2,0.1), - (w_o[0] * np.arange(-1.1,1.2,0.1) + b_o_)/w_o[1],'g-')

    plt.xlim(-1.1, 1.1)
    plt.ylim(-1.1, 1.1)

    delta = 0.01
    a = np.arange(-1.1, 1.1+delta, delta)
    b = np.arange(-1.1, 1.1+delta, delta)
    A, B = np.meshgrid(a, b)

    x_ = np.dstack((A, B)).reshape(-1, 2)

    out = y.eval(feed_dict={x: x_})

    ns = list()
    ns.append(3)
    ns.extend(A.shape)
    out = out.T.reshape(ns)

    plt.pcolor(A, B, out[0], cmap="Blues", alpha=0.2)
    plt.pcolor(A, B, out[1], cmap=('Oranges'), alpha=0.2)
    plt.pcolor(A, B, out[2], cmap=('Greens'), alpha=0.2)
    #out.shape
    # lets visualize the data:
    plt.scatter(X[:, 0], X[:, 1], c=targets, s=40, cmap=plt.cm.Spectral)

    plt.title("Spiral data and decision boundaries in (raw) data space.")

In [76]:

plot_spriral_db()

In [78]:

# as function for slides
def plot_spiral_hdb():
    fig = plt.figure(figsize=(8,8))

    plt.xlim(-1.1, 1.1)
    plt.ylim(-1.1, 1.1)

    h_ = h2.eval({x:X})
    plt.scatter(h_[:, 0], h_[:, 1], c=targets, s=40, cmap=plt.cm.Spectral, lw=1)
 
    Wo = W_o.eval()
    bo = b_o.eval() 
    out = np.dot(x_, Wo) + bo

    out = out.T.argmax(axis=0).reshape(-1,1)
    ohe = sklearn.preprocessing.OneHotEncoder()
    out = ohe.fit_transform(out).toarray()
    out = out.T.reshape(ns)

    plt.pcolor(A, B, out[0], cmap="Blues", alpha=0.2)
    plt.pcolor(A, B, out[1], cmap=('Oranges'), alpha=0.2)
    plt.pcolor(A, B, out[2], cmap=('Greens'), alpha=0.2)


    grid_hidden = h2.eval(feed_dict={x: grid_points})
    plt.scatter(grid_hidden[:,0], grid_hidden[:,1], marker='.', color='k', lw=0., s=5) 
    

    plt.title("In the last hidden layer the data points are linearly separable")

In [79]:

plot_spiral_hdb()

Übungen¶

Variieren Sie die Anzahl der Neurone und beobachten Sie die Veränderungen der Entscheidungsflächen im Datenraum.
Erklären Sie das Verhalten.

Der Fluch der Dimensionen¶

In [84]:

from IPython.display import Image
Image(filename='./pics/Deeplearning 7.png')

Out[84]:

"Real world data" für "KI Tasks" ist hochdimensional.
Volumen wächst exponentiell $r^n$ mit Anzahl der Dimensionen $n$
Warum sind hier nicht (exponentiell mit n) viele Trainingsdaten notwendig?

Manifold Hypothesis¶

Trainingsdaten nur noch in wenig Bereichen des Raums in nicht-linearen "Manifolds".
Grund: starke statistische Abhängigkeit der Variablen.
Manifolds weisen starke Krümmungen im eingebettenten Raum auf
- z.B. Translation des gleichen Bildes um ein Pixel verschiebt das Bild in einen völlig anderen Bereich des einbettenden Datenraums

In [86]:

from IPython.display import Image
Image(filename='./pics/picture-vs-random-pixels.png')

Out[86]:

"Deep Learning" ist "Feature Learning"¶

Wenn ein Algorithmus bzw. ein Neuronales Netz ohne hinreichende Tiefe gewählt wird, so werden exponentiell unnötig viele Units in den Schichten benötigt (siehe [Ben09, Eld16, Tel16]).

In [87]:

from IPython.display import Image
Image(filename='./pics/traditional-ml.png') 
# adapted from [LeCun]

Out[87]:

In [88]:

from IPython.display import Image
Image(filename='./pics/tradional-imageprocessing-speech-approach.png') 
# adapted from [LeCun]

Out[88]:

In [89]:

from IPython.display import Image
Image(filename='./pics/trainable-feature-transformations.png') 
# adapted from [LeCun]

Out[89]:

Courses and further readings:

Colah's blog: Neural Networks, Manifolds, and Topology
G. Hinton: Neural Networks for Machine Learning
[LeCun] Yann LeCun: Feature and Representation Learning

Scientific Papers:

[Be09] Yoshua Bengio: Learning Deep Architectures for AI, Foundations and Trends in Machine Learning, 2(1), pp.1-127, 2009.
[Mon14] G.Montufar, R. Pascanu, K. Cho, and Y. Bengio: "On the number of linear regions of deep neural networks.", NIPS 2014
[Pas14] R. Pascanu, G. Montufar, and Y. Bengio On the Number of Response Regions of Deep Feedforward Networks with Piecewise Linear Activations., Second International Conference on Learning Representations (ICLR 2014).
[Eld16] Ronen Eldan, Ohad Shamir: The Power of Depth for Feedforward Neural Networks, COLT 2016
[Tel16] Telgarsky, Matus. "Benefits of depth in neural networks.; JMLR: Workshop and Conference Proceedings vol 49:1–23, 2016