API¶
Build Unihan into tabular friendly format and export it.
-
unihan_tabular.process.
ALLOWED_EXPORT_TYPES
= [u'json', u'csv', u'yaml']¶ Allowed export types
-
unihan_tabular.process.
DESTINATION_DIR
= u'/home/docs/.local/share/unihan_tabular'¶ Filepath to output built CSV file to.
-
unihan_tabular.process.
INDEX_FIELDS
= [u'ucn', u'char']¶ Default index fields for unihan csv’s. You probably want these.
-
unihan_tabular.process.
UNIHAN_FIELDS
= [u'kAccountingNumeric', u'kBigFive', u'kCCCII', u'kCNS1986', u'kCNS1992', u'kCangjie', u'kCantonese', u'kCheungBauer', u'kCheungBauerIndex', u'kCihaiT', u'kCompatibilityVariant', u'kCowles', u'kDaeJaweon', u'kDefinition', u'kEACC', u'kFenn', u'kFennIndex', u'kFourCornerCode', u'kFrequency', u'kGB0', u'kGB1', u'kGB3', u'kGB5', u'kGB7', u'kGB8', u'kGSR', u'kGradeLevel', u'kHDZRadBreak', u'kHKGlyph', u'kHKSCS', u'kHanYu', u'kHangul', u'kHanyuPinlu', u'kHanyuPinyin', u'kIBMJapan', u'kIICore', u'kIRGDaeJaweon', u'kIRGDaiKanwaZiten', u'kIRGHanyuDaZidian', u'kIRGKangXi', u'kIRG_GSource', u'kIRG_HSource', u'kIRG_JSource', u'kIRG_KPSource', u'kIRG_KSource', u'kIRG_MSource', u'kIRG_TSource', u'kIRG_USource', u'kIRG_VSource', u'kJIS0213', u'kJapaneseKun', u'kJapaneseOn', u'kJis0', u'kJis1', u'kKPS0', u'kKPS1', u'kKSC0', u'kKSC1', u'kKangXi', u'kKarlgren', u'kKorean', u'kLau', u'kMainlandTelegraph', u'kMandarin', u'kMatthews', u'kMeyerWempe', u'kMorohashi', u'kNelson', u'kOtherNumeric', u'kPhonetic', u'kPrimaryNumeric', u'kPseudoGB1', u'kRSAdobe_Japan1_6', u'kRSJapanese', u'kRSKanWa', u'kRSKangXi', u'kRSKorean', u'kRSUnicode', u'kSBGY', u'kSemanticVariant', u'kSimplifiedVariant', u'kSpecializedSemanticVariant', u'kTaiwanTelegraph', u'kTang', u'kTotalStrokes', u'kTraditionalVariant', u'kVietnamese', u'kXHC1983', u'kXerox', u'kZVariant']¶ Default Unihan fields
-
unihan_tabular.process.
UNIHAN_FILES
= [u'Unihan_RadicalStrokeCounts.txt', u'Unihan_NumericValues.txt', u'Unihan_Variants.txt', u'Unihan_DictionaryIndices.txt', u'Unihan_DictionaryLikeData.txt', u'Unihan_OtherMappings.txt', u'Unihan_Readings.txt', u'Unihan_IRGSources.txt']¶ Default Unihan Files
-
unihan_tabular.process.
UNIHAN_URL
= u'http://www.unicode.org/Public/UNIDATA/Unihan.zip'¶ URI of Unihan.zip data.
-
unihan_tabular.process.
UNIHAN_ZIP_PATH
= u'/home/docs/.cache/unihan_tabular/downloads/Unihan.zip'¶ Filepath to download Zip file.
-
unihan_tabular.process.
WORK_DIR
= u'/home/docs/.cache/unihan_tabular/downloads'¶ Directory to use for processing intermittent files.
-
unihan_tabular.process.
download
(url, dest, urlretrieve_fn=<function urlretrieve>, reporthook=None)¶ Download a file to a destination.
Parameters: Returns: destination where file downloaded to.
Return type:
-
unihan_tabular.process.
extract_zip
(zip_path, dest_dir)¶ Extract zip file. Return
zipfile.ZipFile
instance.Parameters: Returns: The extracted zip.
Return type:
-
unihan_tabular.process.
files_exist
(path, files)¶ Return True if all files exist in specified path.
-
unihan_tabular.process.
filter_manifest
(files)¶ Return filtered
UNIHAN_MANIFEST
from list of file names.
-
unihan_tabular.process.
get_fields
(d)¶ Return list of fields from dict of {filename: [‘field’, ‘field1’]}.
-
unihan_tabular.process.
get_parser
()¶ Return
argparse.ArgumentParser
instance for CLI.Returns: argument parser for CLI use. Return type: argparse.ArgumentParser
-
unihan_tabular.process.
has_valid_zip
(zip_path)¶ Return True if valid zip exists.
Parameters: zip_path (str) – absolute path to zip Returns: True if valid zip exists at path Return type: bool
-
unihan_tabular.process.
in_fields
(c, fields)¶ Return True if string is in the default fields.
-
unihan_tabular.process.
listify
(data, fields)¶ Convert tabularized data to a CSV-friendly list.
Parameters: data (list) – List of dicts Params fields: keys/columns, e.g. [‘kDictionary’]
-
unihan_tabular.process.
load_data
(files)¶ Extract zip and process information into CSV’s.
Parameters: files (list) – Return type: str Returns: string of combined data from files
-
unihan_tabular.process.
normalize
(raw_data, fields)¶ Return normalized data from a UNIHAN data files.
Parameters: Returns: list of unihan character information
Return type:
-
unihan_tabular.process.
not_junk
(line)¶ Return False on newlines and C-style comments.
-
unihan_tabular.process.
zip_has_files
(files, zip_file)¶ Return True if zip has the files inside.
Parameters: - files (list) – list of files inside zip
- zip_file (
zipfile.ZipFile
) – zip file to look inside.
Returns: True if files inside of :py:meth:`zipfile.ZipFile.namelist().
Return type:
Utility and helper methods for script.
util¶
-
unihan_tabular.util.
ucn_to_unicode
(ucn)¶ Return a python unicode value from a UCN.
Converts a Unicode Universal Character Number (e.g. “U+4E00” or “4E00”) to Python unicode (u’u4e00’)
-
unihan_tabular.util.
ucnstring_to_python
(ucn_string)¶ Return string with Unicode UCN (e.g. “U+4E00”) to native Python Unicode (u’u4e00’).
-
unihan_tabular.util.
ucnstring_to_unicode
(ucn_string)¶ Return ucnstring as Unicode.
Test helpers functions for downloading and processing Unihan data.