Paper Note: Text Understanding from Scratch

Zhang, Xiang, and Yann LeCun. “Text understanding from scratch." arXiv preprint arXiv:1502.01710(2015).

CNN (convolutional neural network) has been widely used on image problems, but this article replicate the success of CNN on the text understanding problem. They propose a method to convert a sentence into some character-level image-like representation, feed it into a well-designed CNN, and train the model to classify the sentence.

Binary Encoding

70 characters are the alphabets: abcdefghijklmnopqrstuvwxyz0123456789-,;.!?:’’’/\|_@#$%ˆ&* ̃‘+-=<>()[]{}

Every character in a sentence is encoded to a one-of-m (m is the alphabet size, which is 70) code. Sentence length l, thus the input size m x l, a column is a char. The result is somehow like the Braille used for assisting blind reading.

螢幕快照 2016-05-19 上午11.40.19.png

 

CNN Model

螢幕快照 2016-05-19 上午11.42.33.png

The input is as described at the above. Two models (Large / Small) are proposed, both of which have 9-layers (6 conv + 3 fc).

 

Data Augmentation using Thesaurus

They do synonym replacement for words to do augmentation. This is very different to what we do on image data augmentation like flipping, cropping , random rotation.

Chinese

What’s more, the model can be used on other languages than only English. They experiment it on Chinese news data and get good results.

Evaluation

Datasets:

  1. DBpedia Ontology Classification
  2. Amazon Review Sentiment Analysis
  3. Yahoo! Answers Topic Classification
  4. AG’s news corpus
  5. Chinese news

螢幕快照 2016-05-19 上午11.55.02.png螢幕快照 2016-05-19 上午11.55.41.png

螢幕快照 2016-05-19 下午1.13.18.png螢幕快照 2016-05-19 下午1.13.50.png

螢幕快照 2016-05-19 下午1.17.12.png

廣告
Paper Note: Text Understanding from Scratch

發表迴響

在下方填入你的資料或按右方圖示以社群網站登入:

WordPress.com 標誌

您的留言將使用 WordPress.com 帳號。 登出 /  變更 )

Google photo

您的留言將使用 Google 帳號。 登出 /  變更 )

Twitter picture

您的留言將使用 Twitter 帳號。 登出 /  變更 )

Facebook照片

您的留言將使用 Facebook 帳號。 登出 /  變更 )

連結到 %s