unihan-tabular

unihan-tabular - tool to build UNIHAN into tabular-friendly formats like python, JSON, CSV and YAML. Part of the cihai project.

Python Package Documentation Status Build Status Code Coverage License

UNIHAN‘s data is dispersed across multiple files in the format of:

U+3400      kCantonese      jau1
U+3400      kDefinition     (same as U+4E18 ) hillock or mound
U+3400      kMandarin       qiū
U+3401      kCantonese      tim2
U+3401      kDefinition     to lick; to taste, a mat, bamboo bark
U+3401      kHanyuPinyin    10019.020:tiàn
U+3401      kMandarin       tiàn

$ unihan-tabular will download Unihan.zip and build all files into a single tabular friendly format.

CSV (default), $ unihan-tabular:

char,ucn,kCantonese,kDefinition,kHanyuPinyin,kMandarin
,U+3400,jau1,(same as U+4E18 ) hillock or mound,,qiū
,U+3401,tim2,"to lick; to taste, a mat, bamboo bark",10019.020:tiàn,tiàn

JSON, $ unihan-tabular -F json:

[
  {
    "char": "㐀",
    "ucn": "U+3400",
    "kCantonese": "jau1",
    "kDefinition": "(same as U+4E18 丘) hillock or mound",
    "kHanyuPinyin": null,
    "kMandarin": "qiū"
  },
  {
    "char": "㐁",
    "ucn": "U+3401",
    "kCantonese": "tim2",
    "kDefinition": "to lick; to taste, a mat, bamboo bark",
    "kHanyuPinyin": "10019.020:tiàn",
    "kMandarin": "tiàn"
  }
]

YAML $ unihan-tabular -F yaml:

- char: 
  kCantonese: jau1
  kDefinition: (same as U+4E18 丘) hillock or mound
  kHanyuPinyin: null
  kMandarin: qiū
  ucn: U+3400
- char: 
  kCantonese: tim2
  kDefinition: to lick; to taste, a mat, bamboo bark
  kHanyuPinyin: 10019.020:tiàn
  kMandarin: tiàn
  ucn: U+3401

Features

  • automatically downloads UNIHAN from the internet
  • export to JSON, CSV and YAML (requires pyyaml) via -F
  • configurable to export specific fields via -f
  • accounts for encoding conflicts due to the Unicode-heavy content
  • designed as a technical proof for future CJK (Chinese, Japanese, Korean) datasets
  • core component and dependency of cihai, a CJK library
  • data package support
  • supports python 2.7, >= 3.5 and pypy

If you encounter a problem or have a question, please create an issue.

Usage

unihan-tabular supports command line arguments. See unihan-tabular CLI arguments for information on how you can specify custom columns, files, download URL’s and output destinations.

To download and build your own UNIHAN export:

$ pip install unihan-tabular

To output CSV, the default format:

$ unihan-tabular

To output JSON:

$ unihan-tabular -F json

To output YAML:

$ pip install pyyaml
$ unihan-tabular -F yaml

To only output the kDefinition field in a csv:

$ unihan-tabular -f kDefinition

To output multiple fields, separate with spaces:

$ unihan-tabular -f kCantonese kDefinition

To output to a custom file:

$ unihan-tabular --destination ./exported.csv

To output to a custom file (templated file extension):

$ unihan-tabular --destination ./exported.{ext}

See unihan-tabular CLI arguments for advanced usage examples.

Structure

# output w/ JSON
{XDG data dir}/unihan_tabular/unihan.json

# output w/ CSV
{XDG data dir}/unihan_tabular/unihan.csv

# output w/ yaml (requires pyyaml)
{XDG data dir}/unihan_tabular/unihan.yaml

# script to download + build a SDF csv of unihan.
unihan_tabular/process.py

# unit tests to verify behavior / consistency of builder
tests/*

# python 2/3 compatibility module
unihan_tabular/_compat.py

# utility / helper functions
unihan_tabular/util.py