2019-04-15

決定境界(Decision Boundary)とニューラルネットの紹介

機械学習

ゴーリストのチナパです！この度は技術的な記事ではなく、「AIに任せましょう」などを聞くときに、実際に何が起きているのかの直感を作るための記事です。（プログラミングなしです）

データでモデルを学習しそして分類の判断を行う時に、何かの規則を作ってから判断しているのが事実です。複雑な問題であるほど人間の頭で想像しにくくなるかも知れませんが、直感を作ることはできます。

f:id:c-pattamada:20190413162628j:plain — りんごが... 左！

Public Domain Pictures · Photography

簡単な問題でやってみよう

この度はhttps://playground.tensorflow.org, という実験できる場所から始めましょう。ここでは、ノープログラミングでニューラルネットと遊べます。

f:id:c-pattamada:20190413162957p:plain

初めての方だと、色々がありすぎてびっくりするかも知れませんが、一歩ずつ進めて行きます！

スタート地点はこちらです。左側の「Data」のところで、写真の通り、一番簡単なデータを選びましょう。すると、右側の「Output」にそれが反映されます。（一旦、左のメニューにNoiseも0%だと確認しましょう）。

f:id:c-pattamada:20190413163035p:plain

これは二次元のグラフですのでちゃんと二つの軸があります、x軸とy軸と言われてる時もありますが、こちらで横軸がX1と縦軸がX2です。この二つを私たちが測ってます（この場合はランダムで作られてるけど、現実の状況であれば測っていたりしている）。点の種類は二つあります、オレンジと青、そしてこの例ではご覧の通り綺麗な二つの固まりになってます。

計算機を使うまでもなく、オレンジの点がX1=-2, X2=-2あたりが平均みたいだし、青の点がX1=2, X2=2の周りに集中されてます。さて、こんな簡単の問題には機械学習がどう解決するのかも見てみましょう。上のあたりのPlayボタンを押してみてください。

押してから、すぐグラフが綺麗なオレンジと青に分けれれていきます。一瞬にでもこのパターンが見抜かれてます。ま、簡単だしな。

f:id:c-pattamada:20190413163126p:plain

グラフに左がわがオレンジになり、右側が青になりました。そして、斜め白い線がオレンジ組と青くみの間を通ってます。

「この二つの組みを分けてください」と子供に聞いたら、似たような線も書いたかも知れません。この場合、この白い一直線が「決定境界」と言います。白い線の左はオレンジ、右が青。つまり、この線によって判断されてます。

ちょっとずるいことをしてみましょう。

こんな簡単な問題にも、情報が足りなかったら、良い判断が一気にとりにくくなります。

f:id:c-pattamada:20190413163215p:plain

「Features」の下にあるX1の変数あたりをクリックしたら、色が薄めます。これで、今回はX1の情報は機械学習のモデルに渡さないことになってます。つまり、縦軸であるX2の情報だけで、点がオレンジなのか、青なのかを判断しないといけない状況です。またPlayをおし、結果をみましょう。

f:id:c-pattamada:20190413163254p:plain

今度はグラフが横に分けられてます。過ちもあるかも知れません（私の場合、一つの青点がオレンジ背景の部分にあります。なぜなら、縦軸のX2の情報しか得てないため、その情報だけで判断されてるからです。だら、横線になってます。

つまり、決定境界が使ってる変数（Feature、情報）にとても影響されます。

ま、今のところはわかりやすい、ほとんど一直線の例しかみてないので、ハードルを高くしましょう。

渦巻（ナルト？）の出番

X1をもう一回クリックし、モデルにこの情報を与えましょう。ただし、Dataの部分で右下にある渦巻みたいな物を選びましょう。そして、Noiseも入れましょう（私は30%にしましたが、お好きに選んでください）。

f:id:c-pattamada:20190413163420p:plain

Playを押したら….

f:id:c-pattamada:20190413163448p:plain

全然分からない、と言わんばかりのうすい背景と白い横線の決定境界… このうすい色は「自信度が低い」という意味です。学生に聞いて「なんか…上の辺にオレンジがもう少しあるから…こっちがオレンジになるかな…?」みたいな答えです。

ま…一直線を書いてこのデータを分類するのが不可能だからです。渦巻だから。では、どうしたら一直線ではない決定境界を作ることができます?

方法があります。「Feature」の項目で一直線ではないデータで一直線的ではないデータを含めること。

Featureの項目で一番したにあるsin(X1)とsin(X2)も選択して、同じネットワークをplayしましょう。

f:id:c-pattamada:20190413163837p:plain

一気に、決定境界がとんでもないことになってます。

f:id:c-pattamada:20190413163858p:plain

しかも、割とあってるし。一体どうやってここまで分かったでしょう？「渦巻」と分かったのか？

ニューラルネットの中身の分析

ここで、真ん中にある「hidden layer」の部分を見てみましょう。

f:id:c-pattamada:20190413164931p:plain

このネットワークには2つにhidden layerがあります。一つ目には4つの「ニューロン」（神経）があって、2つめは2つの「ニューロン」があります。それぞれのものが何を「気づいている」のかを見るために、マウスを上にホバーしましたら、右のグラフでそのニューロンのアウトプットが観れます。

みなさんとは少し違うかも知れませんが、私の場合最初のhidden layerでは、四つの「気づき」があるみたいです。

f:id:c-pattamada:20190413164827p:plain

1番と2番がとても似ています。これは、sinグラフみたいな波があるようです。この上にはオレンジ、下には青。ですが、これだけでは渦巻にはなりません。

3番目の「気づき」は上から下の階層一番上にはオレンジ、次は青、次はオレンジ、一番したは青。この中には、青が左肩よりだとオレンジは右肩よりだとも気づいているみたです、面白い。どうやってこれができたのかが気になりましたら、このニューロンに入っていく線をホバーしてみると”weight is 〇〇”みたいのものがあります。

f:id:c-pattamada:20190413164950p:plain — ホバーしてからの写真をまとめた

ここでは、(-0.33 x X1) + (0.3 x X2) + (sin(X1) x 1.6) + (sin(X2) x 2.7)のようにこのグラフが作られてますが、数学はそこまでとしましょう。

4番目もX2の逆に似ていて、とりあえず上がオレンジ、下が青と気づいています。

この4つの気づいきを更に組み合わせて、2番のhidden layer にアウトプットを得て、またそのアウトプットを組み合わせて、結果の「渦巻」のようなデータを割と精度高く分類することができる決定境界を描くことができました。

まとめ

ここでX1とX2しか使わなかったため、綺麗な二次元グラフで写すこともできましたが、ディープラーニングのより複雑な分類問題も、結果的に似ているようなことをやっています。フィーチャ数が大きければ大きいほど描きにくい決定境界になりますし、それでこそ便利な道具になっていくこともあります。

例えば...

f:id:c-pattamada:20190413165056p:plain — 結構渦巻っぽくない？

では、他の種類のデータも、遊んで観てください！hidden layer数を増やして、減らして影響を見ることもできますので、分類で使うdeep learningがどのような決定境界を作ることができるのかの直感がこれでできましたら幸いです。

2019-04-11

Deep Learning Using Raw Audio Files

// Feed raw audio files directly into the deep neural network without any feature extraction. //

If you have observed, conventional audio and speech analysis systems are typically built using a pipeline structure, where the first step is to extract various low dimensional hand-crafted acoustic features (e.g., MFCC, pitch, RMSE, Chroma, and whatnot).

Although hand-crafted acoustic features are typically well designed, is still not possible to retain all useful information due to the human knowledge bias and the high compression ratio. And of course, the feature engineering you will have to perform will depend on the type of audio problem that you are working on.

But, how about learning directly from raw waveforms (i.e., raw audio files are directly fed into the deep neural network)?

In this post, let's take learnings from this paper and try to apply it to the following Kaggle dataset.

www.kaggle.com

Go ahead and download the Heatbeat Sounds dataset. Here is how one of the sample audio files from the dataset sounds like

clyp.it

f:id:vivek081166:20190410152348p:plain

The downloaded dataset will have a label either "normal", "unlabelled", or one of the various categories of abnormal heartbeats.

Our objective here is to solve the heartbeat classification problem by directly feeding raw audio files to a deep neural network without doing any hand-crafted feature extraction.

Prepare Data

Let's prepare the data to make it easily accessible to the model.

extract_class_id(): Audio file names have its label in it, so let's separate all the files based on its name and give it a class id. For this experiment let's consider "unlabelled" as a separate class. So as shown above, in total, we'll have 5 classes.

convert_data(): We'll normalize the raw audio data and also make all audio files of equal length by cutting them into 10s if the file is shorter than 10s, pad it with zeros. For each audio file, finally put the class id, sampling rate, and audio data together and dump it into a .pkl file and while doing this make sure to have a proper division of train and test dataset.

gist.github.com

Create and compile the model

f:id:vivek081166:20190410152941p:plain

As written in the research paper, this architecture takes input time-series waveforms, represented as a long 1D vector, instead of hand-tuned features or specially designed spectrograms.

There are many models with different complexities explained in the paper. For our experiment, we will use the m5 model.

m5 has 4 convolutional layers followed by Batch Normalization and Pooling. a callback keras.callback is also assigned to the model to reduce the learning rate if the accuracy does not increase over 10 epochs.

gist.github.com

Start training and see the results

Let's start training our model and see how it performs on the heartbeat sound dataset.

As per the above code, the model will be trained over 400 epochs, however, the loss gradient flattened out at 42 epochs for me, and these were the results. How did yours do?

Epoch 42/400
128/832 [===>..........................] - ETA: 14s - loss: 0.0995 - acc: 0.9766
256/832 [========>.....................] - ETA: 11s - loss: 0.0915 - acc: 0.9844
384/832 [============>.................] - ETA: 9s - loss: 0.0896 - acc: 0.9844 
512/832 [=================>............] - ETA: 6s - loss: 0.0911 - acc: 0.9824
640/832 [======================>.......] - ETA: 4s - loss: 0.0899 - acc: 0.9844
768/832 [==========================>...] - ETA: 1s - loss: 0.0910 - acc: 0.9844
832/832 [==============================] - 18s 22ms/step - loss: 0.0908 - acc: 0.9844 - val_loss: 0.3131 - val_acc: 0.9200

Congratulations! You’ve saved a lot of time and effort extracting features from audio files. Moreover, by directly feeding the raw audio files the model is doing pretty well.

With this, we learned how to feed raw audio files to a deep neural network. Now you can take this knowledge and apply to the audio problem that you want to solve. You just need to collect audio data normalize it and feed it to your model.

The above code is available at following GitHub repository

github.com

That's it for this post, my name is Vivek Amilkanthwar. See you soon with one of such next time; until then, Happy Learning :)

References:

1) https://arxiv.org/pdf/1610.00087.pdf
2) https://github.com/philipperemy/very-deep-convnets-raw-waveforms
3) https://openreview.net/pdf?id=S1Ow_e-Rb

2019-04-01

Teachable Desktop Automation

// Teach your computer to recognize gestures and trigger a set of actions to perform after a certain gesture is recognized.//

Hello World! I'm very excited to share with you my recent experiment wherein I tried to teach my computer certain gestures and whenever those gestures are recognized certain actions will be performed.

In this blog, I'll explain all you need to do to achieve the following

IF: I wave to my webcam
THEN: move the mouse pointer a little to its right

I have used power for Node js to achieve this. The idea is to create a native desktop app which will have access to the operating system to perform certain actions like a mouse click or a keyboard button press and also on the same native desktop app we'll try to train our model and draw inferences locally.

To make it work, I thought of using tensorflow.js and robotjs in an Electron App created using Angular.

f:id:vivek081166:20190329175638p:plain

So, are you ready? let's get started…

Generate the Angular App

Let's start by creating a new Angular project from scratch using the angular-cli

npm install -g @angular/cli

ng new teachable-desktop-automation

cd teachable-desktop-automation

Install Electron

Add Electron and also add its type definitions to the project as dev-dependency

npm install electron --save-dev

npm install @types/electron --save-dev

Configuring the Electron App

Create a new directory inside of the projects root directory and name it as "electron". We will use this folder to place all electron related files.

Afterward, make a new file and call it "main.ts" inside of the "electron" folder. This file will be the main starting point of our electron application.

Finally, create a new "tsconfig.json" file inside of the directory. We need this file to compile the TypeScript file into JavaScript one.

Use the following as the content of "tsconfig.json" file.

gist.github.com

Now it's time to fill the "main.ts" with some code to fire up our electron app.

gist.github.com

Visit electronjs for details

Make a custom build command

Create a custom build command for compiling main.ts & starting electron. To do this, update "package.json" in your project as shown below

{
  "name": "teachable-desktop-automation",
  "version": "0.0.0",
  "main": "electron/dist/main.js", // <-- this was added
  "scripts": {
    "ng": "ng",
    "start": "ng serve",
    "build": "ng build",
    "test": "ng test",
    "lint": "ng lint",
    "e2e": "ng e2e",
    "electron": "ng build --base-href ./ && tsc --p electron && electron ."  // <-- this was added
  },
  // ...omitted
}

We can now start our app using npm:

npm run electron

f:id:vivek081166:20190329180447p:plain

There we go… our native desktop app is up and running! However, it is not doing anything yet.

Let's make it work and also add some intelligence to it…

Add Robotjs to the project

In order to simulate a mouse click or a keyboard button press, we will need robotjs in our project.
I installed robotjs with the following command

npm install robotjs

and then tried to use in the project by referring to some examples on their official documentation. However, I struggled a lot to make robotjs work on the electron app. Finally here is a workaround that I came up with

Add ngx-electron to the project

npm install ngx-electron

And then inject its service to the component where you want to use the robot and use remote.require() to capture the robot package.

import { Component } from '@angular/core';
import { ElectronService } from 'ngx-electron';
@Component({
selector: 'app-root',
templateUrl: './app.component.html',
styleUrls: ['./app.component.scss'],
})
export class AppComponent {
constructor(private electronService: ElectronService) {
   
   this.robot = this.electronService.remote.require('robotjs');
// move mouse pointer to right
   const mousePosition = this.robot.getMousePos();
   this.robot.moveMouse(mousePosition.x + 5, mousePosition.y);
}
}

Add Tensorflow.js to the project

We'll be creating a KNN classifier that can be trained live in our electron app (native desktop app) with images from the webcam.

npm install @tensorflow/tfjs

npm install @tensorflow-models/knn-classifier

npm install @tensorflow-models/mobilenet

A quick reference for KNN Classifier and MobileNet package

Here is a quick reference to the methods that we'll be using in our app. You can always refer tfjs-models for all the details on implementation.

KNN Classifier

knnClassifier.create() : Returns a KNNImageClassifier.
.addExample(example, classIndex) : Adds an example to the specific class training set.
.predictClass(image) : Runs the prediction on the image, and returns an object with a top class index and confidence score.

MobileNet

.load() : Loads and returns a model object.
.infer(image, endpoint) : Get an intermediate activation or logit as Tensorflow.js tensors. Takes an image and the optional endpoint to predict through.

Finally make it work

For this blog post, I'll keep aside the cosmetics part (CSS I mean) apart and concentrate only on the core functionality

Using some boilerplate code from Teachable Machine and injecting robotjs into app component here is how it looks

gist.github.com

and now when running the command npm run electron you see me (kidding)

f:id:vivek081166:20190329181118p:plain

Let's Test it

I'll train the image classifier on me waving to the webcam (Class 2) and also with me doing nothing (Class 1).

Following are the events are associated with these two classes

Class 1: Do nothing
Class 2: Move mouse pointer slightly to the right

youtu.be

With this, your computer can learn your gestures and can perform a whole lot of different things because you have direct access to your operating system.

The source code of this project can be found at below URL…

github.com

I have shared my simple experiment with you. Now it's your turn try to build something with it and consider sharing that with me as well :D

I have just scratched the surface, any enhancements, improvements to the project are welcome though GitHub pull requests.

Well, that's it for this post… Thank you for reading until the end. My name is Vivek Amilkanthawar, see you soon with another one.

2019-03-30

Googleのbertを利用してみました〜！

こんにちは、チナパです！

先日、Word2vecを利用して、単語から数字のための辞書を作成してみました。その続きで、Googleが最近リリースした「bert」(Bidirectional Encoder Representations from Transformers)を利用してみましょう。

f:id:c-pattamada:20190329191856p:plain — 人間よりできる！

bertとは?

まずはそもそもこれは何なのか、なんですごいのかを説明します。

現在までの自然言語処理の技術が文章を読みながら、今までの言葉からコンテキストを理解して、順番での次の言葉の意味を今までの言葉によってのコンテキストで影響されるような技術が代表的でした。

それはRNNの構築を利用して行ってました。

去年、ELMoとbertでは「Attention」を利用し、より精度の高い結果を出せるようになってる。

例えば、チャットボットを作成する時に、Attentionの構築では前のメッセージのキーワードを覚えながらお返事を作成するようなイメージです。

Bertはそれをより強化的に使っています。

“Multi-headed attention”と説明されていますが、複数な箇所をバラバラに注目しながら解析しているような巨大モデルです。

英語の自然言語処理のベンチマークの一つのSQuADでは人間より良い結果が簡単に出せるようです。

bertが事前に単語を学習されたままで利用もできますので、試してみましょう！

素晴らしい！では、どうやって使える??

Bertは割と重いので、今回はColabでやってみました。

!pip install bert-tensorflow

from google.colab import drive
drive.mount('/content/gdrive')

driveの許可を与えて、bertの語彙データをtensorflow_hub_からアクセスしましょう。

from bert.tokenization import FullTokenizer
import pandas as pd
import tensorflow_hub as hub

bert_path = "https://tfhub.dev/google/bert_multi_cased_L-12_H-768_A-12/1" # 日本語用のモデルがこちら

sess = tf.Session()

def create_tokenizer_from_hub_module():
    """Get the vocab file and casing info from the Hub module."""
    bert_module =  hub.Module(bert_path)
    tokenization_info = bert_module(signature="tokenization_info", as_dict=True)
    vocab_file, do_lower_case = sess.run(
        [
            tokenization_info["vocab_file"],
            tokenization_info["do_lower_case"],
        ]
    )

    return FullTokenizer(vocab_file=vocab_file, do_lower_case=do_lower_case)
  
tokenizer = create_tokenizer_from_hub_module()

これで、bertのtokenizerのインスタンスを作りました。MeCabみたいに、文字列を言葉に分けるためのものです。bertでは、漢字が全部一文字ずつのトークンに変換されます。

tokenizer.tokenize('こんにちは、今日の天気はいかがでしょうか？') すると、

['こ',
 '##ん',
 '##に',
 '##ち',
 '##は',
 '、',
 '今',
 '日',
 'の',
 '天',
 '気',
 'は',
 '##い',
 '##か',
 '##が',
 '##で',
 '##し',
 '##ょう',
 '##か',
 '？']

みたいなとんでもない結果が出ます。

"##" が続いているような風に見える。トークンから文字列に戻そうとするとこのあたりは気をつけないと… 後は、変な分からない文字がありましたら、このように出てきます。

tokenizer.tokenize('゛')

# アウトプット=> ['[UNK]']

さて、ベクトル化しましょう。

bertでトークンをベクトル変えるために、3つのインプットが必要です。

tokenizer.convert_tokens_to_ids(...)

で作成できるid以外にsegment idとinput maskが必要です。

input_idがbertで素早くベクトルに変換するためのものです。segment_idは文の番号です、これでメッセージとその返事が分けられたりできます。padding系のインプットを無視できるように、input_mask_を利用します。

f:id:c-pattamada:20190329200016p:plain — そう、これがいい！

一旦は、一つの文章だけの処理をしましょう。上記の3つのベクトルがこのようなメソッドで作られます。

def convert_string_to_bert_input(tokenizer, input_string, max_length=128):

    tokens = []
    tokens.append("[CLS]")
    tokens.extend(tokenizer.tokenize(input_string))
    if len(tokens) > max_seq_length - 2:
        tokens = tokens[0 : (max_seq_length - 2)]
    tokens.append("[SEP]")
    
    segment_ids = [0] len(tokens)
    input_ids = tokenizer.convert_tokens_to_ids(tokens)
    # これから加えるpaddingが無視できるように
    input_mask = [1] * len(tokens)
    
    while len(input_ids) < max_seq_length:
        input_ids.append(0)
        input_mask.append(0)
        segment_ids.append(0)

    return np.array(input_ids),
            np.array(input_mask),
            np.array(segment_ids)

上記のメソッドで"[CLS]"と"[SEP]"を追加しているのが見えます。これはbertで利用する「開始文字」と「文を分ける文字」になってます。複数の文章がある場合に、間に"[SEP]"を入れます。

Pandasでベクトル化したいデータを読み込み、上記のメソッドで変換しましょう。

data_path = ...
input_column = ...
df = pd.read_csv(data_path)

features = df[input_column].map(
                       lambda my_string:
                           convert_string_to_bert_input(tokenizer, my_string)
                       )

bertをkerasで利用

よっし、これからbertをkerasのモデルに利用する魔法を使いましょう。

https://github.com/strongio/keras-bert/blob/master/keras-bert.ipynb

こちらはかなり参考になりました。

class BertLayer(tf.layers.Layer):
    def __init__(self, n_fine_tune_layers=10, **kwargs):
        self.n_fine_tune_layers = n_fine_tune_layers
        self.trainable = False
        self.output_size = 768
        super(BertLayer, self).__init__(**kwargs)

    def build(self, input_shape):
        self.bert = hub.Module(
            bert_path,
            trainable=self.trainable,
            name="{}_module".format(self.name)
        )

        trainable_vars = self.bert.variables

        # Remove unused layers
        trainable_vars = [var for var in trainable_vars if not "/cls/" in var.name]

        # Select how many layers to fine tune
        trainable_vars = trainable_vars[-self.n_fine_tune_layers :]

        # Add to trainable weights
        for var in trainable_vars:
            self._trainable_weights.append(var)
            
        for var in self.bert.variables:
            if var not in self._trainable_weights:
                self._non_trainable_weights.append(var)

        super(BertLayer, self).build(input_shape)

    def call(self, inputs):
        inputs = [K.cast(x, dtype="int32") for x in inputs]
        input_ids, input_mask, segment_ids = inputs
        bert_inputs = dict(
            input_ids=input_ids, input_mask=input_mask, segment_ids=segment_ids
        )
        result = self.bert(inputs=bert_inputs, signature="tokens", as_dict=True)[
            "pooled_output"
        ]
        return result

    def compute_output_shape(self, input_shape):
        return (input_shape[0], self.output_size)

これはkerasのカスタムLayerを作成し、その中にbertを利用するようなコードです。自分のカスタムLayerを書くときにこちらもとても便利です https://keras.io/layers/writing-your-own-keras-layers/

重要な部分は3箇所あります。

1. build()

def build(self, input_shape):
    self.bert = hub.Module(
            bert_path,
            trainable=self.trainable,
            name="{}_module".format(self.name)
        )

ここでtensor flow hubを使って、bert_path_で定義されたモデルを利用します。

2. call()

input_ids, input_mask, segment_ids = inputs
        bert_inputs = dict(
            input_ids=input_ids, input_mask=input_mask, segment_ids=segment_ids
        )
        result = self.bert(inputs=bert_inputs, signature="tokens", as_dict=True)[
            "pooled_output"
        ]

ここでは、先ほど用意したinputをdictに変え、bertのモデルの結果を求める。現在はsignature="tokens"以外のサポートがありません(29/03/2019)、"pooled_output"と別に"sequence_output"もあります。pooledの方が単語ベクトルを作成する時に使いますので、そちらで行きます。

3. output size

self.output_size = 768

こちらは使ってるモデルに応じて設定する数字です。普通のモデルのそれぞれのレイヤーが768次元で帰ってきます。Largeのモデルでは1024次元です。

これで先ほど作成した3つの変数をdictに入れ、self.bertで結果を得ます。上記のlayerを普通のkerasのモデルで使えます。

例:

def get_model(max_length=128, num_classes=5):
    input_ids = tf.keras.layers.Input(shape=(max_seq_length,), name="input_ids")
    in_mask = tf.keras.layers.Input(shape=(max_seq_length,), name="input_masks")
    in_segment = tf.keras.layers.Input(shape=(max_seq_length,), name="segment_ids")
    bert_inputs = [input_ids, in_mask, in_segment]
    
    bert_output = BertLayer(n_fine_tune_layers=1)(bert_inputs)
    dense = tf.keras.layers.Dense(128, activation='relu')(bert_output)
    pred = tf.keras.layers.Dense(num_classes=5, activation='sigmoid')(dense)
    
    model = tf.keras.models.Model(inputs=bert_inputs, outputs=pred)
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.summary()

ぜひ試してみましょう！

まとめ

今回、今までの自然言語処理のやり方、とbertの簡単な説明をしました。それから、文字列をbertに適切な形に変換して、モデルで利用できるような形にしました。

今回は、自分も理解し切れてない部分もありますが、ぜひフィードバックを聞きたいです〜

2019-03-26

Word2Vecで辞書を作成して見ました

機械学習 Python 自然言語処理

こんにちは！チナパです！日本語の単語が難しくて、辞書が必要だなと思いました。私はhttps://jisho.org をよく使いますが、なかなかパソコンに説明しにくくて...

はい、冗談でした。

自然言語処理では日本語（あるいは他の言語）の「辞書」を利用する必要があります。「辞書」とはなんなのかがご存知かと思いますが、単語を決まった次元の数字ベクトルに変換するための辞書のことです。

f:id:c-pattamada:20190325201950j:plain — ベクトル辞書？！

このタスクは「文字レベル」、「単語レベル」、そして最近「文レベル」にもできるようになってます。

文字レベルと単語レベルの方はGloVeとWord2Vecなどの方法があります、今日はWord2Vecでシンプルな辞書をPythonで作って見ましょう。

注意: 自然言語処理では辞書の最適化によってタスクにかなりの改善も見られますが、今回は一旦デモができるレベルの辞書にはなれます。

必要な材料

パソコンで2-3GBぐらいの容量(cloudとColabでもできますが、今回はローカルにします)
以下のライブラリ
- MeCab, wget, gensim, jaconv

import MeCab
import wget
from gensim.corpora import WikiCorpus
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
import jaconv
from multiprocessing import cpu_count
import os

VECTORS_SIZE = 50

wiki_file_name = 'jawiki-latest-pages-articles1.xml-p1p106175.bz2'
ja_wiki_latest_url = 'https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles1.xml-p1p106175.bz2'

# 上記のwikiファイルは部分的なものです。
# 他の種類のデータがいかのurlで見つけられます
# https://dumps.wikimedia.org/jawiki/latest/

次元数は文字レベルの場合に20辺りでもいいかもしれませんが、単語レベルで行う際には50次元最低はオススメです。その以下であれば(自分の経験で）言葉の意味が捕まえきれてないケースが多いです。

データを手に入れよう

さて、まずはデータを大量に必要ですので、wikipediaをダウンロードしましょう！

if not os.path.isfile(wiki_file_name):
    wget.download(ja_wiki_latest_url, bar=wget.bar_adaptive)

今回は一部をダウンロードしています、役15分かかりました。全ての記事をダウンロードする場合は、3GBと3時間ぐらいが必要となります。

日本語の場合、同じ言葉や文字が全角か半角だったりするケースがあります。ローマジの場合、大文字と小文字のほとんどの場合に同じ意味を持つ(ｗａｔｅｒとwater) ここで、jaconvのライブラリを使い、すべてを全角に統一します。

def normalize_text(text):
    return jaconv.h2z(text, digit=True, ascii=True, kana=True).lower()

半角に統一したい場合、jaconv.z2hとメソッドを利用できます。辞書を利用する際、同じように統一する必要があります。

上記のメソッドを利用して、以前ダウンロードしたwikipediaデータを全角に変えながら、普通のtxtファイルに保存しましょう。

wiki_text_file_name = 'wiki.txt'

def read_wiki(wiki_data, save_file):
    if os.path.isfile(save_file):
        print('Skipping reading wiki file...')
        return
    with open(save_file, 'w') as out:
            wiki = WikiCorpus(wiki_data, lemmatize=False, dictionary={}, processes=cpu_count())
            wiki.metadata = True
            texts = wiki.get_texts()
            for i, article in enumerate(texts):
                text = article[0]  # article[1] は記事名です
                sentences = [normalize_text(line) for line in text]
                text = ' '.join(sentences) + u'\n'
                out.write(text)
                if i % 1000 == 0 and i != 0:
                    print('Logged', i, 'articles')
    print('wiki保存完了')


read_wiki(wiki_file_name, wiki_text_file_name)

ここでは、gensimのWikiCorpusを利用して簡単に解読できます。これで学習できるデータが揃いましたので次のステップに行きましょう。

単語トーケン作成

こちらはメカブの出番です。

def get_words(text: str, mt: MeCab.Tagger) -> List[str]:
    mt.parse('')
    parsed = mt.parseToNode(text)
    components = []
    while parsed:
        if len(parsed.surface) >= 1:  # EOSを覗くためにあります
            components.append(parsed.surface)
        parsed = parsed.next
    return components

get_wordsのメソッドを利用して文書を単語(トーケン）に分けることができます。MeCabのparseToNode(text)メソッドの結果にはsurface とfeatureの二つのものがありますが、ここではsurfaceしか利用しません。

では、先ほど作成したwikipediaのデータをトーケン化しましょう。

token_file = 'tokens.txt'

def tokenize_text(input_filename: str, output_filename: str, mt: MeCab.Tagger):
    lines = 0
    if os.path.isfile(output_filename):
        lines = count_lines(output_filename)  # 続く場合には何行を飛ばすかを調べる
    batch = []
    with open(input_filename, 'r') as data:
        for i, text in enumerate(data.readlines()):
            if i < lines:
                continue
            tokenized_text = ' '.join(get_words(text, mt))
            batch.append(tokenized_text)
            if i % 10000 == 0 and i != 0:
                write_tokens(batch, output_filename)
                batch = []
                print('Tokenized ,', i, 'lines')
    write_tokens(batch, output_filename)
    print('トーケン作成完了')


def write_tokens(batch: List[str], file_name: str):
    with open(file_name, 'a+') as out:
        for out_line in batch:
            out.write(out_line)
            out.write('\n')

            
def count_lines(file: str) -> int:
    count = 0
    with open(file) as d:
        for line in d:
            count += 1
    return count
    
tagger = MeCab.Tagger('-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd') #neologd がなければ、別なものを使ってもOK
tokenize_text(wiki_text_file_name, token_file, tagger)

Wikipediaの記事ごとをトーケンに変え、半角スペースを入れ、またファイルに出力しています。メモリーにもつのもありかもしれませんが、RAMを結構使うケースが多いために、ファイルに出力させていただいてます。

ベクトル化

いおいおベクトル作成の時がきました。コード的には簡単にかけます。

vector_file = 'ja-MeCab-50.data.model'
def generate_vectors(input_filename, output_filename):
    if os.path.isfile(output_filename):
        return
    model = Word2Vec(LineSentence(input_filename),
                     size=VECTORS_SIZE, window=5, min_count=5,
                     workers=cpu_count(), iter=5)
    model.save(output_filename)
    print('ベクトル作成完了。')

generate_vectors(token_file, vector_file)

GensimのWord2Vecで、上記のように定義された50次元のベクトルが作成されます。min_count_は辞書に含まれるための最低回数です。使ってるデータに応じて編集してみてください。

ベクトル作成もかなりの時間かけますので、ローカルで行う場合、一晩中学習させてもらう感じではオススメです。また、google datalabやawsで実装してみた方が良いかもしれません。

お試しタイム！

これで作成できましたので、どうやって使えばいいのか、、

from gensim.models import load_model
import pprint

model = load_model(vector_file)
model = model.wv

pprint.pprint(model['東京'])
pprint.pprint(mm.most_similar(positive='東京', topn=5))

で試してみてください！

f:id:c-pattamada:20190326105354p:plain — 私のお試しの結果

まとめ

今回はベクトル辞書の作成について説明いたしました。wikipediaのデータで学習し、MeCabを利用してトーケン化して、最後にWord2Vecのベクトル辞書の作成を行いました。

この辞書の結果がEmbeddingとしてニューラルネットにも使えます。でもそちらはまた別に機会に。

トップの写真はこちらからいただきました: https://www.pexels.com/photo/black-and-white-book-business-close-up-267669/

今回のコードは以下です gist.github.com

また、こちらも参考になりました。 GitHub - philipperemy/japanese-words-to-vectors: Word2vec (word to vectors) approach for Japanese language using Gensim and Mecab.

2019-03-26

Audio Classification using AutoML Vision

For a given audio dataset, can we do audio classification using Spectrogram? well, let's try it out ourselves and let's use Google AutoML Vision to fail fast :D

We'll be converting our audio files into their respective spectrograms and use spectrogram as images for our classification problem.

Here is the formal definition of the Spectrogram

A Spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time.

For this experiment, I'm going to use the following audio dataset from Kaggle

www.kaggle.com

go ahead and download the dataset {Caution!! : The dataset is over 5GB, so you need to be patient while you perform any action on the dataset. For my experiment, I have rented a Linux virtual machine on Google Could Platform (GCP) and I'll be performing all the steps from there. Moreover, you need a GCP account to follow this tutorial}

Step 1: Download the Audio Dataset

Training Data (4.1 GB)

curl https://zenodo.org/record/2552860/files/FSDKaggle2018.audio_train.zip?download=1 --output audio_train.zip

upzip audio_train.zip

Test Data (524 MB)

curl https://zenodo.org/record/2552860/files/FSDKaggle2018.audio_test.zip?download=1 --output audio_test.zip

unzip audio_test.zip

Metadata (150 KB)

curl https://zenodo.org/record/2552860/files/FSDKaggle2018.meta.zip?download=1 --output meta_data.zip

unzip meta_data.zip

After downloading and unzipping you should have the following things in your folder
(Note: I have the renamed the folder after unzipping )

f:id:vivek081166:20190325191702p:plain

Step 2: Generate Spectrograms

Now that we have our audio data in place, let's create spectrograms for each audio file.

We'll need FFmpeg to create spectrograms of audio files

ffmpeg.org

Install FFmpeg using the following command

sudo apt-get install ffmpeg

Try it out yourself… go into the folder which has an audio file and run the following command to create its spectrogram

ffmpeg -i audioFileName.wav -lavfi showspectrumpic=s=1024x512 anyName.jpg

For example, "00044347.wav" from training dataset will sound like this

clyp.it

and spectrogram of "00044347.wav" looks like this

f:id:vivek081166:20190326105505j:plain

As you can see, the red area shows loudness of the different frequencies present in the audio file and it is represented over time. In the above example, you heard a hi-hat. The first part of the file is loud, and then the sound fades away and the same can be seen in its spectrogram.

The above ffmpeg command creates spectrogram with the legend, however; we do not require legend for image processing so let's drop legend and create a plain spectrogram for all our image data.

Use the following shell script to convert all your audio files into their respective spectrograms
(Create and run the following shell script at the directory level where "audio_data" folder is present)

gist.github.com

I have moved all the generated image file into the folder "spectro_data"

f:id:vivek081166:20190325192534p:plain

Step 3: Move image files to Storage

Now that we have generated spectrograms for our training audio data, let's move all these image files on Google Cloud Storage (GCS) and from there we will use those files in AutoML Vision UI.

Use the following command to copy image files to GCS

gsutil cp spectro_data/* gs://your-bucket-name/spectro-data/

f:id:vivek081166:20190325192818p:plain

Step 4: Prepare file paths and their label

I have created the following CSV file using metadata that we have downloaded earlier. Removing all the other columns, I have kept only the image file location and its label because that's what is needed for AutoML.

f:id:vivek081166:20190325193037p:plain

docs.google.com

You will have to put this CSV file on your Cloud Storage where the other data is stored.

Step 5: Create a new Dataset and Import Images

Go to AutoML Vision UI and create a new dataset

cloud.google.com

f:id:vivek081166:20190325193335p:plain

Enter dataset name as per your choice and for importing images, choose the second options "Select a CSV file on Cloud Storage" and provide the path to the CSV file on your cloud storage.

f:id:vivek081166:20190325193453p:plain

The process of importing images may take a while, so sit back and relax. You'll get an email from AutoML once the import is completed.

After importing of image data is done, you'll see something like this

f:id:vivek081166:20190325193633p:plain

Step 6: Start Training

This step is super simple… just verify your labels and start training. All the uploaded images will be automatically divided into training, validation and test set.

f:id:vivek081166:20190325193844p:plain

Give a name to your new model and select a training budget
For our experiment let's select 1 node hour (free*) as training budget and start training the model and see how it performs.

f:id:vivek081166:20190325193935p:plain

Now again wait for training to complete. You'll receive an email once the training is completed, so you may leave the screen and come back later, meanwhile; let the model train.

f:id:vivek081166:20190325194009p:plain

Step 7: Evaluate

and here are the results…

f:id:vivek081166:20190325194207p:plain

Hurray … with very minimal efforts our model did pretty well

f:id:vivek081166:20190325194242p:plain

Congratulations! with only a few hours of work and with the help of AutoML Vision we are now pretty much sure that classification of given audio files using its spectrogram can be done using machine learning vision approach. With this conclusion, now we can build our own vision model using CNN and do parameter tuning and produce more accurate results.

Or, if you don't want to build your own model, go ahead and train the same model with more number of node-hours and use the instructions given in PREDICT tab to use your model in production.

That's it for this post, I'm Vivek Amilkanthawar from Goalist. See you soon with one of such next time; until then, Happy Learning :)

goalist.co.jp

2019-02-28

Choosing a Deep Learning Framework

Implementing deep learning algorithms from scratch using Python and NumPY is a good way to get an understanding of the basic concepts, and to understand what these deep learning algorithms are really doing by unfolding the deep learning black box.

However, as you start to implement very large or more complex models, such as convolutional neural networks (CNN) or recurring neural networks (RNN), it is increasingly not practical, at least for most of the people like me, is not practical to implement everything yourself from scratch.

Even though you understand how to do matrix multiplication and you are able to implement it in your code. But as you build very large applications, you'll probably not want to implement your own matrix multiplication function but instead, you want to call a numerical linear algebra library that could do it more efficiently for you. Isn't it?

The efficiency of your algorithm will help you fail fast 😃 and thus will help you to complete your iteration throughout the IDEA -> EXPERIMENT -> CODE cycle much more quickly.

f:id:vivek081166:20190228134309p:plain

I think this is crucially important when you are in the middle of Deep Learning pipeline.

f:id:vivek081166:20190228134333p:plain

So let's take a look at the frameworks out there…

Today, there are many deep learning frameworks that make it easy for you to implement neural networks, and here are some of the leading ones.

f:id:vivek081166:20190228134407p:plain

Each of these frameworks has a dedicated user and developer community and I think each of these frameworks is a credible choice for some subset of applications. However, when I see the below graph my obvious choice goes for TensorFlow.

f:id:vivek081166:20190228134323p:plain

Well, I just said that I would choose TensorFlow. But, is only the above popularity scores matter while choosing a framework for your deep learning project? Turns out not…

I think many of these frameworks are evolving and getting better very rapidly. If the framework scores top in popularity in 2018 then by the end of 2019 it may not hold the same position.

There are a lot of people writing articles comparing these deep learning frameworks and how well these deep learning frameworks changes. And because these frameworks are often evolving and getting better month to month, I'll leave you to do a few internet searches yourself, if you want to see the arguments on the pros and cons of some of these frameworks.

So, how can you make a decision about which framework to use?

Rather than strongly endorsing any of these frameworks, I would like to share three factors that Stanford Professor Andrew Ng considers important enough to influence your decision.

1) Ease of programming

This includes developing, iterating, and finally, deploying your neural network to production where it may be used by millions of users.

2) Running Speeds

Training on large data sets can take a lot of time, and differences in training speed between frameworks can make your workflow a lot more time efficient.

3) Openness

This last criterion is not often discussed, but Andrew Ng believes it is also very important. A truly open framework must be open source, of course, but must also be governed well.
So it is important to use a framework from the company that you can trust. As the number of people starts to use the software, the company should not gradually close off what was open source, or perhaps move the functionality into their own proprietary cloud services.

But at least in the short term depending on your preferences of language, whether you prefer Python or Java or C++ or something else, and depending on what application you're working on, whether this can be division or natural language processing or online advertising or something else, I think multiple of these frameworks could be a good choice.

f:id:vivek081166:20190228135807p:plain

So that was just a higher level abstraction of deep learnig programming framework. Any of these frameworks can make you more efficient as you develop machine learning applications.

In a subsequent post, we'll take a step from zero → one to learn TensorFlow.

That's it for this post, my name is Vivek from Goalist. See you soon with one of such next time; until then, Happy Learning :)

goalist.co.jp