# How Machine Learns To Read

How to use machine to read through human language

#### Requirements beyond anaconda3 (python libraries, run pip or pip3 to install them)

* tensorflow
* keras
* pinyin
* jieba
* bcolz

### Download the data

The [ctrip hotel comment dataset](http://45.76.223.58/hotel.csv), click the link to download.

### Settings

In [44]:
TASK = "hotel"
TAG_FIELD = "status"
SEQ_LEN = 100 # sequence length, less than this we pad, more than this we drop
LABEL_LEN = 2
CHAR_SIZE = 3000
FACTOR_NB = 30
TK_VOCAB_SIZE = 3000
CN_VOCAB_SIZE = 2000
PY_VOCAB_SIZE = 50

## Take a look at the data

In [69]:
nlp_df = pd.read_csv("http://45.76.223.58/hotel.csv")
nlp_df = pd.read_csv("/data/hotel.csv")
nlp_df.sample(10) # sample 10 lines and their labels

Unnamed: 0,hotel,status
774,我是看了点评才入住海景嘉福的，感觉还不如怡东。服务并不好。房间内通风不好有异味。也没看到有送...,GOOD
3577,"该洒店设施不全且不洁净,让客户感觉不太敢放心使用,且服务人员态度不是太好,早上入住客人收拾房...",EVIL
1178,"酒店整体感觉还行,但是地毯不干净,需要清洗.",GOOD
2419,真的很差，本来是为了考试离考点近一点才住的，基本上没有服务，房间阴暗一股怪味，本来定的是大床...,EVIL
4255,"房价非常贵,几乎没有房间可以看到对岸的象鼻山.送的水果饼干又少又不好吃.装修很低档的,离市区...",EVIL
2765,这是我在携程遇到过的最差的酒店，问题多多1.这大冷天的。房间塑钢窗密封性极差，四处漏风，第二...,EVIL
1565,"酒店房间太小了性价比不是很好,不过越靠北越是这样,没有办法啊,还是南方的酒店性价比好!",GOOD
2272,我非常不满意新月阁客栈，如果有零分可以选，我宁可一分也不肯给.他们的服务太差了.明明含早餐的...,EVIL
3430,不愧是westin，房间就不用说了，游泳池啊room service都很不错！卫生间很大！,GOOD
3438,这是第一次订这家酒店，酒店位置还可以，但肯定开业已经有一段年月，装修有点旧，办了入住手续，进...,EVIL


In [99]:
lines = list(nlp_df[TASK])

len_lines=len(lines)

from jieba import cut
from multiprocessing import Pool

# google python multiprocessing

def cutline(line):
    return "|".join(list(cut(str(line))))

def cutline_l(line):
    return list(cut(str(line)))

In [None]:
p=Pool(6)

In [73]:
%time shards=p.map(cutline,lines)

CPU times: user 50.2 ms, sys: 27.9 ms, total: 78.1 ms
Wall time: 8.11 s


In [101]:
%time shards_list=p.map(cutline_l,lines)

CPU times: user 223 ms, sys: 65.9 ms, total: 289 ms
Wall time: 6.97 s


In [75]:
%time token_list=("|".join(shards)).split("|")

CPU times: user 35.2 ms, sys: 19.5 ms, total: 54.7 ms
Wall time: 54.8 ms


In [159]:
from collections import Counter

# 统计词频

token_counter=Counter(token_list)
token_w=list(token_counter.keys())
token_c=list(token_counter.values())

In [160]:
token_df=pd.DataFrame({"token":token_w,"count":token_c}).sort_values(by="count",ascending=False)

In [161]:
token_df

Unnamed: 0,count,token
10,30629,，
7,20237,的
44,12430,。
52,7685,了
23,7356,","
9,6262,酒店
3,5780,是
235,4792,我
304,4497,
14,4451,房间


#### Most frequent used tokens

In [81]:
token_df.head(20)

Unnamed: 0,count,token
10,30629,，
7,20237,的
44,12430,。
52,7685,了
23,7356,","
9,6262,酒店
3,5780,是
235,4792,我
304,4497,
14,4451,房间


#### Pronounciation Reresentation

In [None]:
# !!pip install pinyin

In [162]:
# Pin yin transfer function
import pinyin
def readcn(x):
    return pinyin.get(str(x),format="strip", delimiter=" ").lower()

In [181]:
len(nlp_df)

4676

In [167]:
nlp_df[TASK+"_py"]=nlp_df[TASK].apply(readcn)

nlp_df.sample(10)

Unnamed: 0,hotel,status,hotel_py,tk_list
1114,"房间小,风格老.特别是服务糟糕.我是3月初住的,当时房间空调吹冷风,跟服务员反映三次都没任何...",EVIL,"fang jian xiao , feng ge lao . te bie shi fu w...","[房间, 小, ,, 风格, 老, ., 特别, 是, 服务, 糟糕, ., 我, 是, 3..."
2300,对该酒店不是很满意.预定入住2天.结果入住当日查询无客人姓名.通过协成沟通没问题了以后..再...,EVIL,dui gai jiu dian bu shi hen man yi . yu ding r...,"[对, 该, 酒店, 不是, 很, 满意, ., 预定, 入住, 2, 天, ., 结果, ..."
2635,5月初入住这家酒店，只可以用2个字形容“糟糕”，是我见过的最差的5星级，晚上1楼的disco...,EVIL,5 yue chu ru zhu zhe jia jiu dian ， zhi ke yi ...,"[5, 月初, 入住, 这家, 酒店, ，, 只, 可以, 用, 2, 个, 字, 形容, ..."
3576,酒店外表及大堂看上去挺好的，本来对它也挺有信心的，而且就在市中心，出行挺方便的，可是从进电梯...,EVIL,jiu dian wai biao ji da tang kan shang qu ting...,"[酒店, 外表, 及, 大堂, 看上去, 挺, 好, 的, ，, 本来, 对, 它, 也, ..."
2566,有那么多朋友点评钻石公寓，不妨再讲两句。钻石新推出的“特价”双人房，其实一点也不特价。以前有...,EVIL,you na yao duo peng you dian ping zuan shi gon...,"[有, 那么, 多, 朋友, 点评, 钻石, 公寓, ，, 不妨, 再, 讲, 两句, 。,..."
2799,几乎是在凌晨才到的包头，包头也没有什么特别好的酒店，每次来了就是住在这家，所以也没有忒多的对...,GOOD,ji hu shi zai ling chen cai dao de bao tou ， b...,"[几乎, 是, 在, 凌晨, 才, 到, 的, 包头, ，, 包头, 也, 没有, 什么, ..."
3140,离公司较近，所以选择了这家3星酒店餐饮价格很有竞争力，和普通饭店接近。免费注册 网站导航 宾...,GOOD,li gong si jiao jin ， suo yi xuan ze le zhe ji...,"[离, 公司, 较近, ，, 所以, 选择, 了, 这家, 3, 星, 酒店, 餐饮, 价格..."
4042,长城酒店的外形很壮观。是嘉峪关里两家四星酒店之一。不过太陈旧了。优点1.离雄关广场走路10分...,GOOD,chang cheng jiu dian de wai xing hen zhuang gu...,"[长城, 酒店, 的, 外形, 很, 壮观, 。, 是, 嘉峪关, 里, 两家, 四星, 酒..."
1186,男友是老外第一次来上海所以选择住的离外滩近的酒店，25号入住一切OK，宾馆周围环境不怎么样就...,EVIL,nan you shi lao wai di yi ci lai shang hai suo...,"[男友, 是, 老外, 第一次, 来, 上海, 所以, 选择, 住, 的, 离, 外滩, 近..."
784,"感觉挺好的,是家不错的酒店,下个月去办事打算还住那儿.",GOOD,"gan jue ting hao de , shi jia bu cuo de jiu di...","[感觉, 挺, 好, 的, ,, 是, 家, 不错, 的, 酒店, ,, 下个月, 去, 办..."


In [169]:
def text2number(col_series,c2i_dict,default_idx):
    final=[]
    for line in list(col_series):
        final.append(list(c2i_dict[lt] if lt in c2i_dict else default_idx for lt in list(cut(str(line)))))
    return np.array(final)

def s2ttl_list(col_series):
    return "".join(list(col_series.apply(lambda x:str(x))))
    
def make_dict(totalist,seq_len,char_size=CHAR_SIZE,namestr="nlp_analysis"):
    tdf=pd.DataFrame(list(totalist),columns=["char"])
    # vocabulary frequency count
    tdf_c=list(pd.crosstab(tdf["char"],columns="char").sort_values(by="char",ascending=False)[:char_size].index.values)

    # Make character to index & index to character
    char2idx = dict((v,k) for k,v in enumerate(tdf_c))
    idx2char = dict((k,v) for k,v in enumerate(tdf_c))
    
    np.save(u"/data/dict/%s_char2idx.npy"%(namestr),char2idx)
    np.save(u"/data/dict/%s_idx2char.npy"%(namestr),idx2char)

    return char2idx,idx2char

def word2array(col_series,char2idx):
    idx_array = sequence.pad_sequences(text2number(col_series=col_series, c2i_dict=char2idx, default_idx=CHAR_SIZE-1),
                                       maxlen=SEQ_LEN,
                                       value=CHAR_SIZE-1
                                      )
    return idx_array

### Making and saving dictionary

In [94]:
import numpy as np

In [95]:
char2idx_tk,idx2char_tk = make_dict(totalist=token_list,seq_len=SEQ_LEN,
                                    char_size=TK_VOCAB_SIZE,namestr="rcc_%s_tk"%(TASK))
char2idx_cn,idx2char_cn = make_dict(totalist=s2ttl_list(nlp_df[TASK]),seq_len=30,
                                    char_size=CN_VOCAB_SIZE,namestr="rcc_%s_cn"%(TASK))
char2idx_py,idx2char_py = make_dict(totalist=s2ttl_list(nlp_df[TASK+"_py"]),seq_len=50,
                                    char_size=PY_VOCAB_SIZE,namestr="rcc_%s_py"%(TASK))

Total chars:	3000
------------------------------------------------------------
The first 10 chars :	，的。了,酒店是我 房间
The last 10 chars :	旧旧关不上酒店客房视野邯郸不吃#油烟味茶叶差得
------------------------------------------------------------
The sequence length:	100
saving_dict
Dictionary Saved To /data/dict/rcc_hotel_tk_char2idx.npy
Dictionary Saved To /data/dict/rcc_hotel_tk_idx2char.npy
Total chars:	2000
------------------------------------------------------------
The first 10 chars :	，的。不是了一房,店
The last 10 chars :	碍碎樱爷仁阅钓辈抛径
------------------------------------------------------------
The sequence length:	30
saving_dict
Dictionary Saved To /data/dict/rcc_hotel_cn_char2idx.npy
Dictionary Saved To /data/dict/rcc_hotel_cn_idx2char.npy
Total chars:	50
------------------------------------------------------------
The first 10 chars :	 inauehgod
The last 10 chars :	!v）76（；？9“
------------------------------------------------------------
The sequence length:	50
saving_dict
Dictionary Saved To /data/dict/rcc_hotel_

In [102]:
nlp_df["tk_list"]=shards_list

nlp_df

Unnamed: 0,hotel,status,hotel_py,tk_list
0,可以这样评价是我们住过的最差的酒店，简陋的大堂，肮脏的房间，可怕的卫生间，绝对是一场恶梦！！,EVIL,ke yi zhe yang ping jia shi wo men zhu guo de ...,"[可以, 这样, 评价, 是, 我们, 住, 过, 的, 最差, 的, 酒店, ，, 简陋,..."
1,"很安静,隔音设施不错.服务员态度很好,下次还会选这里",GOOD,"hen an jing , ge yin she shi bu cuo . fu wu yu...","[很, 安静, ,, 隔音, 设施, 不错, ., 服务员, 态度, 很, 好, ,, 下次..."
2,"很典型的型酒店就是房g很小,l生g也不大。并且设施非常的简陋。空{@得f了些",EVIL,hen dian xing de xing jiu dian jiu shi fang g ...,"[很, 典型, 的, 型, 酒店, 就是, 房, g, 很小, ,, l, 生, g, 也,..."
3,"这样的酒店一般,前台服务员善意提醒预定房间有些吵(靠近高架),因为只有这样的房间才有收费的宽...",EVIL,"zhe yang de jiu dian yi ban , qian tai fu wu y...","[这样, 的, 酒店, 一般, ,, 前台, 服务员, 善意, 提醒, 预定, 房间, 有些..."
4,因为悦华酒店没房间了才通过携程定泉州酒店的。其实泉酒的整体比悦华大气，装修也很讲究，唯一比不...,GOOD,yin wei yue hua jiu dian mei fang jian le cai ...,"[因为, 悦华, 酒店, 没, 房间, 了, 才, 通过, 携程, 定, 泉州, 酒店, 的..."
5,简单来讲，房间还好吧，大过万达假日。不过当时冰雪节比较贵，觉得没那么划算。平时哈尔滨这个烂城...,GOOD,jian dan lai jiang ， fang jian huan hao ba ， d...,"[简单, 来讲, ，, 房间, 还好, 吧, ，, 大过, 万达, 假日, 。, 不过, 当..."
6,"房间设施可以,但空间偏小,客房电梯很特别(没房卡上不去),很适合某些人使用(偷吃)........",GOOD,"fang jian she shi ke yi , dan kong jian pian x...","[房间, 设施, 可以, ,, 但, 空间, 偏小, ,, 客房, 电梯, 很, 特别, (..."
7,酒店真的不怎么样，不知道怎么评上四星的。房间又小又旧，早餐也没什么吃的，服务也不好，而且会有...,EVIL,jiu dian zhen de bu zen yao yang ， bu zhi dao ...,"[酒店, 真的, 不怎么样, ，, 不, 知道, 怎么, 评, 上, 四星, 的, 。, 房..."
8,房间不是很大，但是卫生间装的很有创意。早餐很好，品种多，而且味道不错，服务人员的态度也很好。,GOOD,fang jian bu shi hen da ， dan shi wei sheng ji...,"[房间, 不是, 很大, ，, 但是, 卫生间, 装, 的, 很, 有, 创意, 。, 早餐..."
9,酒店别具一格，很有一些园林风尚。服务非常到位，无论是门童还是前台乃至清洁女工，处处体现热情、...,GOOD,jiu dian bie ju yi ge ， hen you yi xie yuan li...,"[酒店, 别具一格, ，, 很, 有, 一些, 园林, 风尚, 。, 服务, 非常, 到位,..."


In [104]:
# train valid seperation point
tv_point = int(len(nlp_df)*.8)

train_df = nlp_df[:tv_point]
valid_df = nlp_df[tv_point:]

len(train_df),len(valid_df)

(3740, 936)

In [105]:
trn_cn,trn_py,trn_tk,trn_y = train_df[TASK],train_df[TASK+"_py"],train_df["tk_list"],train_df[TAG_FIELD]
val_cn,val_py,val_tk,val_y = valid_df[TASK],valid_df[TASK+"_py"],valid_df["tk_list"],valid_df[TAG_FIELD]

In [None]:
!!sudo apt install bcolz

In [108]:
from tqdm import trange
import os
import bcolz

class nlp_prepro:
    def __init__(self,seq_len,vocab,char2idx,idx2char):
        self.seq_len = seq_len
        self.vocab = vocab
        self.char2idx = char2idx
        self.idx2char = idx2char
        self.pad_blank = np.ones(self.seq_len)*(vocab-1)
        
    def dict_map(self,x):
        try:
            return self.char2idx[x]
        except:
            return self.vocab-1
        
    def charlist2idx(self,charlist):
        return np.vectorize(self.dict_map)(charlist)
    
    def pad_idx(self,x):
        x = self.charlist2idx(x)
        padded = self.pad_blank.copy()
        padded[-min(len(x),self.seq_len):]=x[-self.seq_len:]
        return padded
    
    def split_cn(self,x):
        return list(str(x))
    
    def split_py(self,x):
        return list(str(x).split(" "))
    
    def split_tk(self,x):
        return x
    
    def to_bcolz(self,bc_dir,series,split_func):
        if len(bc_dir)>5:
            os.system("rm -rf %s"%(bc_dir))
        bc = bcolz.carray(np.zeros((0,self.seq_len)),rootdir=bc_dir,dtype="int")
        t=trange(len(series))
        for i in t:
            bc.append(self.pad_idx(split_func(series[i])))
            bc.flush()
        return bc

In [115]:
cn_pre = nlp_prepro(seq_len=150,vocab=CN_VOCAB_SIZE,char2idx = char2idx_cn,idx2char=idx2char_cn)

trn_cn_bc = cn_pre.to_bcolz("/data/nlp/%s_cn"%(TASK),trn_cn,cn_pre.split_cn)

py_pre = nlp_prepro(seq_len=200,vocab=PY_VOCAB_SIZE,char2idx = char2idx_py,idx2char=idx2char_py)

trn_py_bc = py_pre.to_bcolz("/data/nlp/%s_py"%(TASK),trn_py,py_pre.split_py)

tk_pre = nlp_prepro(seq_len=150,vocab=TK_VOCAB_SIZE,char2idx = char2idx_tk,idx2char=idx2char_tk)

trn_tk_bc = tk_pre.to_bcolz("/data/nlp/%s_tk"%(TASK),trn_tk,tk_pre.split_tk)

100%|██████████| 3740/3740 [00:05<00:00, 629.30it/s]
100%|██████████| 3740/3740 [00:06<00:00, 605.08it/s]
100%|██████████| 3740/3740 [00:05<00:00, 628.80it/s]


In [178]:
trn_cn_bc

carray((3740, 150), int64)
  nbytes := 4.28 MB; cbytes := 602.74 KB; ratio: 7.27
  cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
  chunklen := 13; chunksize: 15600; blocksize: 15600
  rootdir := '/data/nlp/hotel_cn'
  mode    := 'a'
[[1999 1999 1999 ..., 1115   32   32]
 [1999 1999 1999 ...,  272   26   75]
 [1999 1999 1999 ...,  879    5  222]
 ..., 
 [1999 1999 1999 ...,   93    1   15]
 [1999 1999 1999 ...,   18   43   39]
 [1999 1999 1999 ...,  101  220   71]]

In [116]:
print(trn_cn_bc.shape)
print(trn_py_bc.shape)
print(trn_tk_bc.shape)

(3740, 150)
(3740, 200)
(3740, 150)


In [117]:
val_cn_pre = nlp_prepro(seq_len=150,vocab=CN_VOCAB_SIZE,char2idx = char2idx_cn,idx2char=idx2char_cn)

val_cn_bc = val_cn_pre.to_bcolz("/data/nlp/val_%s_cn"%(TASK),val_cn.reset_index()[TASK],val_cn_pre.split_cn)

val_py_pre = nlp_prepro(seq_len=200,vocab=PY_VOCAB_SIZE,char2idx = char2idx_py,idx2char=idx2char_py)

val_py_bc = val_py_pre.to_bcolz("/data/nlp/val_%s_py"%(TASK),val_py.reset_index()[TASK+"_py"],val_py_pre.split_py)

val_tk_pre = nlp_prepro(seq_len=150,vocab=TK_VOCAB_SIZE,char2idx = char2idx_tk,idx2char=idx2char_tk)

val_tk_bc = val_tk_pre.to_bcolz("/data/nlp/val_%s_tk"%(TASK),val_tk.reset_index()["tk_list"],val_tk_pre.split_tk)

100%|██████████| 936/936 [00:01<00:00, 606.16it/s]
100%|██████████| 936/936 [00:01<00:00, 498.76it/s]
100%|██████████| 936/936 [00:01<00:00, 582.29it/s]


In [119]:
train_lbl = list(train_df[TAG_FIELD].apply(lambda x:0 if x=="GOOD" else 1))
valid_lbl = list(valid_df[TAG_FIELD].apply(lambda x:0 if x=="GOOD" else 1))

In [121]:
trn_y = bcolz.carray(np.eye(2)[train_lbl],rootdir = "/data/nlp/trn_y_%s"%TASK,dtype="int")

val_y = bcolz.carray(np.eye(2)[valid_lbl],rootdir = "/data/nlp/val_y_%s"%TASK,dtype="int")

##  LSTM

In [124]:
from keras.layers import *
from keras.models import *

In [134]:
def txt_ipt(text_name,seq_len,vocab_size,fac_n):
    # return input layer and flattened embedding layer
    ipt=Input((seq_len,),name="ipt_"+text_name)
    emb=Embedding(vocab_size,fac_n,input_shape=(1,seq_len),name="emb_"+text_name)(ipt)
    # before Flatten(), so 2D dropout
    emb=SpatialDropout1D(rate=0.7)(emb)
    emb=LSTM(40,name="lstm_"+text_name)(emb)
    return ipt,emb


In [135]:
dr_rate=0.6
# Input & Embedding Layers
ipt,emb=txt_ipt(TASK,150,CHAR_SIZE,FACTOR_NB)
cn_ipt,cn_emb=txt_ipt(TASK+"_cn",150,CN_VOCAB_SIZE,FACTOR_NB)
py_ipt,py_emb=txt_ipt(TASK+"_py",200,PY_VOCAB_SIZE,FACTOR_NB)
#py_ipt,py_emb=txt_ipt("hob_py",50,py_vocab_size,factor_nb)
# Concatenate Embedding
d1=concatenate([emb,cn_emb,py_emb],axis=1)
d2=Dense(1024,activation="relu")(d1)
d2=Dropout(dr_rate)(d2)
d2=Dense(512,activation="relu")(d2)
d2=Dropout(dr_rate)(d2)
output=Dense(LABEL_LEN,activation="softmax",name="Output_Layer")(d2)

In [136]:
model=Model([ipt,cn_ipt,py_ipt],output)
model.compile(optimizer="Adam",loss="categorical_crossentropy",metrics=["accuracy"])

In [137]:
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
ipt_hotel (InputLayer)          (None, 150)          0                                            
__________________________________________________________________________________________________
ipt_hotel_cn (InputLayer)       (None, 150)          0                                            
__________________________________________________________________________________________________
ipt_hotel_py (InputLayer)       (None, 200)          0                                            
__________________________________________________________________________________________________
emb_hotel (Embedding)           (None, 150, 30)      90000       ipt_hotel[0][0]                  
__________________________________________________________________________________________________
emb_hotel_

In [138]:
# from keras import callbacks

# SAVE_DIR="/data/logs"
# tb=callbacks.TensorBoard(log_dir=SAVE_DIR+TASK,batch_size=128,write_graph=True)

In [139]:
BATCH_SIZE = 128
class bcolz_gen:
    def __init__(self,tk_array,cn_array, py_array, y,batch_size=BATCH_SIZE):
        self.tk_array = tk_array
        self.cn_array = cn_array
        self.py_array = py_array
        self.y = y
        self.batch_size = batch_size
        self.idx=0
        self.n=int(len(self.tk_array)/self.batch_size)
        
    def __len__(self):
        return self.n
    
    def __next__(self):
        self.idx_a = self.idx*self.batch_size
        self.idx_b = self.idx_a + self.batch_size
        rt = [self.tk_array[self.idx_a:self.idx_b],self.cn_array[self.idx_a:self.idx_b],self.py_array[self.idx_a:self.idx_b]],self.y[self.idx_a:self.idx_b]
#         self.tk_array.flush()
#         self.cn_array.flush()
#         self.py_array.flush()
        self.y.flush()
        self.idx +=1
        if self.idx == self.n: self.idx=0
        return rt

In [140]:
trn_gen=bcolz_gen(tk_array=trn_tk_bc,cn_array=trn_cn_bc,py_array=trn_py_bc,y=trn_y)
val_gen=bcolz_gen(tk_array=val_tk_bc,cn_array=val_cn_bc,py_array=val_py_bc,y=val_y)

In [141]:
model.optimizer.lr=1e-3
model.fit_generator(trn_gen,steps_per_epoch=trn_gen.n,epochs=10,validation_data=val_gen,validation_steps=val_gen.n)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x138fd9978>