Pkuseg Toolkit

A multi-domain Chinese word segmentation toolkit. link

Highlights

The pkuseg-python toolkit has the following features:

  1. Supporting multi-domain Chinese word segmentation. Pkuseg-python supports multi-domain segmentation, including domains like news, web, medicine, and tourism. Users are free to choose different pre-trained models according to the domain features of the text to be segmented. If not sure the domain of the text, users are recommended to use the default model trained on mixed-domain data.

  2. Higher word segmentation results. Compared with existing word segmentation toolkits, pkuseg-python can achieve higher F1 scores on the same dataset.

  3. Supporting model training. Pkuseg-python also supports users to train a new segmentation model with their own data.

  4. Supporting POS tagging. We also provide users POS tagging interfaces for further lexical analysis.

Authors

Ruixuan Luo, Jingjing Xu, Xuancheng Ren, Yi Zhang, Bingzhen Wei,Xu Sun

Avatar
Jingjing Xu
NLP Ph.D. (fourth-year)

I am a PhD candidate, supervised by Prof. Xu Sun, at MOE Key Laboratory of Computational Linguistics, School of Electronics Engineering and Computer Science, Peking University. I received the degree of Bachelor of College of Information Engineering from Northwest A&F University in 2015. I have great interests in Natural Language Processing and Deep Learning. Currently, my research areas include knowledge-aware language understanding and generation, adversarial attack for robust machine learning.