RcppMeCab
This package, RcppMeCab, is a Rcpp wrapper for the part-of-speech morphological analyzer MeCab. It supports native utf-8 encoding in C++ code and CJK (Chinese, Japanese, and Korean) MeCab library. This package fully utilizes the power Rcpp brings R computation to analyze texts faster.
__Please see this for easy installation and usage examples in Korean.__
MeCab backends
RcppMeCab builds MeCab from source at install time. The MeCab variant is selected by the MECAB_LANG environment variable:
MECAB_LANG |
Backend | Version | Source |
|---|---|---|---|
ko (default) |
mecab-ko-msvc | 0.999 | Pusnow/mecab-ko-msvc |
ja |
MeCab | 0.996 | taku910/mecab |
On Linux and macOS, if MeCab is already installed system-wide (detected via mecab-config), RcppMeCab uses the system installation regardless of MECAB_LANG.
Installation
Linux, macOS, and Windows
RcppMeCab automatically downloads and builds MeCab from source if it is not already installed on your system. No manual MeCab installation is required.
install.packages("RcppMeCab") # install from CRAN
# or install the development version
# install.packages("devtools")
devtools::install_github("junhewk/RcppMeCab")
If you already have MeCab installed (e.g. via brew install mecab on macOS, or apt install libmecab-dev on Linux), RcppMeCab will use your system installation.
Language selection
Set MECAB_LANG before installation to choose the MeCab variant:
# Korean (default)
install.packages("RcppMeCab", type = "source")
# Japanese
Sys.setenv(MECAB_LANG = "ja")
install.packages("RcppMeCab", type = "source")
Dictionary
A MeCab dictionary is automatically downloaded and installed during package installation:
- Korean (
MECAB_LANG=ko, default): mecab-ko-dic (pre-compiled, from mecab-ko-msvc releases) - Japanese (
MECAB_LANG=ja): IPAdic (compiled from source during installation)
The bundled dictionary is stored in the package's dic/ directory and used automatically — no manual dictionary setup is required.
Downloading additional dictionaries
You can download and install dictionaries for other languages after installation using download_dic(). No system-level MeCab installation is required — dictionary compilation is handled entirely within R.
download_dic("ja") # download and compile Japanese IPAdic
download_dic("ko") # download Korean mecab-ko-dic
download_dic("zh") # download and compile Chinese mecab-jieba
Dictionaries are stored in the user data directory (tools::R_user_dir("RcppMeCab", "data")) and persist across R sessions.
Use list_dic() to see all installed dictionaries:
list_dic()
#> lang name path active
#> 1 bundled bundled /path/to/RcppMeCab/dic TRUE
#> 2 ja ipadic ~/.local/share/R/RcppMeCab/ja FALSE
#> 3 ko mecab-ko-dic ~/.local/share/R/RcppMeCab/ko FALSE
#> 4 zh mecab-jieba ~/.local/share/R/RcppMeCab/zh FALSE
Usage
This package has pos and posParallel functions.
pos(sentence) # returns a list
pos(sentence, join = FALSE) # morphemes only (tags as vector names)
pos(sentence, format = "data.frame") # returns a data frame
pos(sentence, user_dic = "path") # with a compiled user dictionary
posParallel(sentence) # parallelized, faster for large inputs
Switching languages
Use the lang parameter to select a dictionary by language:
pos("東京は日本の首都です。", lang = "ja")
pos("안녕하세요", lang = "ko")
pos("我是中国人。", lang = "zh")
Or set a default with set_dic():
set_dic("ja")
pos("東京は日本の首都です。") # uses Japanese dictionary
set_dic("ko")
pos("안녕하세요") # uses Korean dictionary
set_dic("bundled") # switch back to the build-time dictionary
You can also specify a custom dictionary path directly:
pos("text", sys_dic = "/path/to/custom-dic")
options(mecabSysDic = "/path/to/custom-dic")
Parameters
sentence: text to analyzejoin: ifTRUE(default), output ismorpheme/tag; ifFALSE, output ismorphemewith tag as attributeformat:"list"(default) or"data.frame"lang: language code ("ja","ko", or"zh") to select a dictionary installed viadownload_dic(). Overridessys_dicwhen specified.sys_dic: directory containingdicrc,sys.dic, etc. Set a default withoptions(mecabSysDic = "/path/to/dic")user_dic: path to a user dictionary compiled bydict_index()
Note: provide full paths for sys_dic and user_dic (no tilde ~/ expansion).
Compiling a user dictionary
RcppMeCab provides the dict_index() function to compile user dictionaries directly from R, without needing the mecab-dict-index command-line tool.
Prepare your entries as a CSV file (Japanese format, Korean format), then compile:
dict_index(
dic_csv = "entries.csv",
out_dic = "userdic.dic",
dic_dir = "/path/to/mecab-dic"
)
# Then use the compiled dictionary:
pos("some text", user_dic = "userdic.dic")
Authors
Junhewk Kim ([email protected]), Taku Kudo
Contributors
Akiru Kato, Patrick Schratz