在对文本数据集进行预处理之后,后续办法与其他机器学习方法类似,即:特征工程+模型建立求解。本文介绍两种方法
Bag Of Words(Bow)
所谓BOW,便是先将所有出现过的单词记录下来,接着将句子转化为文本向量,表示盖茨在文本中出现的频率或次数
EXP
有以下两个句子:
文本1:I love machine learning
文本2:I love programming
构建词汇表:['I', 'love', 'machine', 'learning', 'programming']
然后,两个文本就可以分别表示为:
文本1的向量:[1, 1, 1, 1, 0]
文本2的向量:[1, 1, 0, 0, 1]
变体也有很多,比如说N-gram模型(主要增加语法信息)
BOW最大的特定就是简单,对于词频信息捕捉的很完美,但是缺乏词汇之间的相互作用(向量内的值没有互相联系)
流程
- 数据预处理
def review_to_word(): 1. 常用处理 2. 返回一个string。
- BOW
from sklearn.feature_extraction.text import CountVectorizer # initialize vectorizer = CountVectorizer( analyzer = 'word', tokenizer = None, preprocessor = None, stop_words = None, max_features = 5000 # 指定特征个数 train_data_features = vectorizer.fit_transform(clean_train_reviews) print(train_data_features) # transfer into np form train_data_features = train_data_features.toarray() print(train_data_features)
------特征格式未定,期待有更好的结果-----
stopword不需要,因为前文已经处理过的,相较于包我更相信自己手动处理的东西 - 测试
train_data_features.shape # (总句子数量,特征词汇数量) # 词汇表 vocab = vectorizer.get_feature_names_out() print(vocab) # 词频 dist = np.sum(train_data_features, axis=0) print(dist)
- 构建模型(RandomForest为例)
from sklearn.ensemble import RandomForestClassifier forest = RandomForestClassifier(n_estimators = 100) forest = forest.fit( train_data_features, train["sentiment"] )
Word2Vec
用于克服文本之间单词相互独立的问题,主要方式有两种:Skip-Gram与CBOW。前者用于预测上下文词汇,后者用于预测目标词。但是主要的问题在于只有固定的窗口(给定的上下文窗口长度)
结果
"dog" → [0.23, 0.51, -0.11, ..., 0.75]
"cat" → [0.21, 0.48, -0.13, ..., 0.78]
"king" → [0.45, 0.62, -0.09, ..., 0.50]
里面的数字是隐含特征,可以理解为PCA降维之后的值一样不可解释。但是通过几何意义可以解释,两个单词之间向量的相似度(余弦相似度)
流程
- 数据预处理+句子拆分为段落
def review_to_wordlist(review, remove_stopwords = False): # 1. HTML review_test = BeautifulSoup(review).get_text() # 2. non-letteers review_only = re.sub("[^a-zA-Z]", " ", review_test) # 3. lower and split words review_lower = review_only.lower().split() # 4. Word2vec no need to delete stopword and number if remove_stopwords: stops = set(stopwords.words("english")) words = [w for w in review_lower if not w in stops] # 5. return words list return (review_lower) tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') def review_to_sentences(review, tokenizer, remove_stopwords = False): # 1. 段落->n个句子 raw_sentences = tokenizer.tokenize(review.strip()) # 2. 句子->单词 sentences = [] for raw_sentence in raw_sentences: # If a sentence is empty, skip it if len(raw_sentence) > 0: # Otherwise, call review_to_wordlist to get a list of words sentences.append( review_to_wordlist( raw_sentence, \ remove_stopwords )) return sentences
结束后的结果应该为list中,每个单词为单独元素
- 处理数据
sentences = [] for review in train["review"]: sentences += review_to_sentences(review, tokenizer) for review in unlabeled_train["review"]: sentences += review_to_sentences(review, tokenizer)
- Word2Vec
import logging logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\ level=logging.INFO) # setting parameters num_features = 300 min_word_count = 40 num_workers = 1 context = 10 downsampling = 1e-3 # Word2Vec from gensim.models import word2vec model = word2vec.Word2Vec(sentences, workers=num_workers, vector_size=num_features, min_count = min_word_count, window = context, sample = downsampling) model.init_sims(replace=True) model_name = "300features_40minwords_10context" model.save(model_name)
- 查看结果(三种方式)
model.wv.doesnt_match("man woman child kitchen".split()) model.wv.most_similar("man") model.wv.most_similar("awful")
与RandomFoerst相结合
1. 向量平均
"dog" 对应的词向量:[1, 2, 3]
"cat" 对应的词向量:[4, 5, 6]
"pet" 对应的词向量:[7, 8, 9]
那么向量的平均计算[4, 5, 6]
接着,将句子中的所以单词取平均后带入到RF中就行了
2. 聚类
词汇表 {dog, cat, apple, car, tree, run}
dog → [0.2, 0.4, -0.5, ...]
cat → [0.1, 0.3, -0.4, ...]
apple → [0.5, 0.7, -0.3, ...]
簇1:{dog, cat}
簇2:{apple, car, tree, run}
随后生成一个特征向量,表示文本中每个簇的频率分布,即:
文本1: [0.2, 0.5, 0.3](20% 的词属于簇1,50% 属于簇2,30% 属于簇3)
文本2: [0.1, 0.6, 0.3](10% 的词属于簇1,60% 属于簇2,30% 属于簇3)
将文本划为训练集,标签(label)划为结果。带入到随机森林中训练