Building a Japanese NLP Pipeline with MeCab and spaCyProcessing Japanese text requires tools that understand the language’s unique characteristics: no spaces between words, agglutinative morphology, and extensive use of particles, conjugations, and mixed scripts (kanji, hiragana, katakana, and Latin). This article walks through how to build a robust Japanese NLP pipeline using MeCab, a fast and accurate morphological analyzer, together with spaCy, a modern NLP framework. It covers installation, integration, tokenization and part-of-speech tagging, lemmatization, custom dictionaries, combining statistical and rule-based methods, downstream tasks (POS tagging, dependency parsing, named entity recognition, text classification), performance considerations, and deployment tips.
Why MeCab + spaCy?
- MeCab is a mature, high-performance Japanese morphological analyzer that segments text into morphemes and provides morphological features (POS, base forms, readings). MeCab excels at tokenization and morphological analysis for Japanese.
- spaCy is a fast, production-ready NLP library with a consistent API for pipelines, models, and deployment. spaCy provides pipeline orchestration, model training, and downstream task tools.
- Combining MeCab’s language-specific strengths with spaCy’s ecosystem yields a practical, high-performance pipeline tailored for Japanese NLP.
1. Overview of the Pipeline
A typical pipeline using MeCab and spaCy:
- Text input (raw Japanese)
- Preprocessing (normalization: Unicode NFKC, full-width/half-width handling, punctuation)
- Tokenization & morphological analysis with MeCab (surface form, POS, base form, reading)
- Convert MeCab outputs into spaCy-compatible Doc objects (tokens with attributes)
- Apply spaCy components: tagger, parser, NER, lemmatizer, custom components
- Optional: custom dictionaries, domain-specific rules, embedding layers, fine-tuned models
- Downstream tasks: classification, information extraction, search indexing, summarization
- Deployment (REST API, batch processing, microservices)
2. Installation
Environment assumptions: Linux or macOS (Windows possible via WSL), Python 3.8+.
- Install MeCab (system library) and a dictionary (IPAdic or UniDic). UniDic offers richer morphological info; IPAdic is widely used.
-
On Ubuntu:
sudo apt update sudo apt install mecab libmecab-dev mecab-ipadic-utf8
-
On macOS (Homebrew):
brew install mecab mecab-ipadic
- Install Python bindings and spaCy.
pip install mecab-python3 fugashi[unidic-lite] spacy sudachipy sudachidict_core
Notes:
- mecab-python3 provides direct MeCab bindings.
- fugashi is a modern wrapper compatible with spaCy integrations (often used with unidic-lite).
- You may prefer UniDic for improved analysis; install unidic-lite or unidic and point fugashi to it.
- Install spaCy Japanese models. As of 2025, spaCy supports Japanese via third-party models like GiNZA or spaCy’s built-in support with tokenizers that integrate with MeCab/fugashi.
Example with GiNZA:
pip install ginza python -m ginza download
Or using spaCy’s UDPipe/Ginza models:
pip install ja-ginza python -m spacy download ja_ginza_electra
3. Tokenization and Morphological Analysis
MeCab outputs token surface, POS, base form (lemma), and reading. Use fugashi for easy Python integration:
from fugashi import Tagger tagger = Tagger() # uses default dictionary (unidic-lite if installed) text = "今日は良い天気ですね。" tokens = list(tagger(text)) for t in tokens: print(t.surface, t.feature.pos1, t.feature.form, t.feature.lemma)
Converting MeCab tokens into spaCy Doc objects lets you use spaCy components. Use the spacy.tokens.Doc class and set token attributes like .lemma, .pos, .tag_.
4. Integrating MeCab with spaCy
Option A — Use GiNZA or ja_ginza which bundles MeCab-like analysis with spaCy-ready pipelines. This is the simplest route:
import spacy nlp = spacy.load("ja_ginza_electra") doc = nlp("今日は良い天気ですね。") for token in doc: print(token.text, token.lemma_, token.pos_, token.tag_)
Option B — Custom pipeline: run MeCab/fugashi first, then construct a spaCy Doc with attributes. Example:
import spacy from spacy.tokens import Doc from fugashi import Tagger nlp = spacy.blank("ja") tagger = Tagger() def mecab_to_doc(nlp, text): words = [w.surface for w in tagger(text)] doc = Doc(nlp.vocab, words=words) # Optionally set morphological attributes from Tagger features for i, w in enumerate(tagger(text)): token = doc[i] token.tag_ = w.feature.pos1 # coarse POS token.lemma_ = getattr(w.feature, "lemma", token.text) return doc doc = mecab_to_doc(nlp, "今日は良い天気ですね。")
This approach gives full control but requires mapping MeCab features to spaCy token attributes and ensures pipeline components expect those attributes.
5. Lemmatization & Base Forms
Japanese verbs and adjectives conjugate heavily; MeCab provides base forms (dictionary forms). Use those for lemmatization:
- MeCab’s feature fields include dictionary form/reconstructed lemma. Map that into token.lemma_.
- For nouns and loanwords, surface form may equal lemma.
Example mapping with fugashi/unidic features:
for w in tagger("食べました"): print(w.surface, w.feature.lemma) # -> 食べる
6. Custom Dictionaries & Domain Adaptation
- MeCab supports user dictionaries to add domain-specific words (product names, jargon, named entities) to improve tokenization.
- Create user dictionary CSVs, compile with mecab-dict-index, and load with -u path/to/user.dic.
- For fugashi/mecab-python3, pass dictionary options when initializing Tagger:
tagger = Tagger("-d /usr/local/lib/mecab/dic/unidic -u /path/to/user.dic")
- Test with ambiguous compounds and named entities; adjust dictionary entries for surface, reading, base form, and POS.
7. Named Entity Recognition (NER)
Options:
- Use GiNZA or ja_ginza models which include NER trained on UD/NE corpora and are spaCy-compatible.
- Train a custom spaCy NER using your labelled data. Convert MeCab tokenization into spaCy Docs with entity spans and train with spaCy’s training API.
- Use rule-based NER for high-precision patterns (regex, token sequences) as a pre- or post-processing step.
Example: combining rule-based and statistical NER
- Run MeCab to segment.
- Apply regex/lookup for product codes, acronyms.
- Pass doc into spaCy NER for person/location/org detection.
- Merge or prioritize results by confidence/heuristics.
8. Dependency Parsing and Syntax
spaCy models like GiNZA provide dependency parsing tuned for Japanese, but Japanese has flexible word order and topic-prominent constructions. Consider:
- Using UD-style dependency annotations (GiNZA, ja_ginza) for interoperability.
- Training or fine-tuning parsers with domain-specific treebanks if accuracy is critical.
- Using chunking or phrase-level analysis when full dependency parsing is noisy.
9. Text Classification & Embeddings
- For classification (sentiment, topic), represent text via:
- MeCab tokenized words + bag-of-words / TF-IDF
- Word/subword embeddings (word2vec trained on MeCab tokens)
- Contextual embeddings: fine-tune Japanese transformer models (e.g., cl-tohoku/bert-base-japanese, Japanese Electra) using spaCy’s transformer integration or Hugging Face.
- Example pipeline: MeCab tokenization → map tokens to embeddings → average/pool → classifier (logistic regression, SVM, or neural network).
10. Performance Considerations
- MeCab is fast; use compiled user dictionaries and avoid repeated re-initialization of Tagger in tight loops.
- For high throughput, run Tagger in a persistent worker (Uvicorn/Gunicorn async workers) or use multiprocessing.
- Combine MeCab’s speed with spaCy’s optimized Cython operations by converting to spaCy Doc once per text and using spaCy pipelines for heavier tasks.
11. Evaluation & Debugging
- Evaluate tokenization accuracy by comparing MeCab output to gold-standard segmented corpora.
- Use confusion matrices for POS, precision/recall for NER, LAS/UAS for parsing.
- Inspect failure cases: unknown words, merged compounds, incorrect lemma. Update user dictionary or retrain models.
12. Deployment Tips
- Containerize the pipeline (Docker) with explicit versions of MeCab, dictionaries, Python packages.
- Expose an inference API for tokenization/analysis; batch requests for throughput.
- Monitor latency and memory; cache compiled dictionaries and spaCy models in memory.
- Consider quantized or distilled models for transformer components in latency-sensitive environments.
13. Example End-to-End Script
# example_japanese_pipeline.py import spacy from fugashi import Tagger from spacy.tokens import Doc nlp = spacy.load("ja_ginza_electra") # or spacy.blank("ja") + custom components tagger = Tagger() def mecab_tokens(text): return list(tagger(text)) def to_spacy_doc(nlp, text): tokens = [t.surface for t in mecab_tokens(text)] doc = Doc(nlp.vocab, words=tokens) for i, t in enumerate(mecab_tokens(text)): tok = doc[i] tok.lemma_ = getattr(t.feature, "lemma", tok.text) tok.tag_ = t.feature.pos1 return nlp(doc.text) if __name__ == "__main__": text = "国会では新しい法案が議論されています。" doc = to_spacy_doc(nlp, text) for ent in doc.ents: print(ent.text, ent.label_) for token in doc: print(token.text, token.lemma_, token.pos_)
14. Further Reading & Resources
- MeCab documentation and dictionary guides
- GiNZA/ja_ginza spaCy model docs
- UniDic vs IPAdic comparison notes
- Japanese corpus resources: Kyoto University Text Corpus, Balanced Corpus of Contemporary Written Japanese (BCCWJ)
- Hugging Face Japanese transformer models
Building a Japanese NLP pipeline with MeCab and spaCy gives you precise tokenization and a modern, trainable pipeline for downstream tasks. Start simple (tokenize + lemma + NER) and incrementally add custom dictionaries, training data, and transformer components as your needs grow.
Leave a Reply