TransWikia.com

Open Software Two-Line Add Pinyin To Characters

Chinese Language Asked by Mo. on October 23, 2020

Is there any open software that adds a second line of pinyin to a line of characters?

i.e.: converting “你好” to:

nǐhǎo

你好

Converting to ni3hao3 would be fine to.

Open, namely, so that I can tinker with the pinyin to fit my needs.

4 Answers

I think it's needless to point out that you do realize that [hanzi --> pinyin] is not a straightforward function, because of the ambiguous relation of characters and pronunciation (多音字).

Character by character rendering is a very simple way, but will leave you with many false pinyin syllables and/or multiple options.This site gives all the different pinyins of a character in parentheses, e.g.

INPUT: 你好
OUTPUT: nǐ[hǎo hāo hào]

So any decent approach would require a lookup in a multi-character word dictionary. If there is nothing online that comes close to this, you could download the latest CEDICT file, remove all but the Traditional, Simplified and the Pinyin fields and use Python/awk, grep to write a script to bulk lookup Chinese words and output the results to a file or similar. Would not be that complex, although you would still need an 'algorithm' to identify word boundaries, otherwise just use 'brute force' to look up all the different matching entries in CEDICT.

What is the application context or purpose of what you would like to do?

(I am adding this as an answer because it's too long for a comment and also because of the URLs)


UPDATE

Some users have recommended the pypinyin module which claims to support 多音字. I have run a test of pypinyin against the CEDICT dictionary file, which is human compiled by many users and is the largest open source Chinese dictionary and is used by many apps (including Pleco).

Test results: The CEDICT file contains over 113k entries (compiled over many years and continuously updated). For each entry in the CEDICT file I have compared the pinyin transcription against the output for the Simplified character expression of the pypinyin module.

I have accounted for:

  • differences in pinyin transcription, like 'ü' vs. 'v'
  • transcriptin of ('er' vs. 'r') in erhua expressions
  • upper- and lowercase pinyin transcriptions
  • some obvious errors in the CEDICT file (missing pinyin) or whitespace and · (middle dot) and comma usage

To be fair, I have also excluded all the single character CEDICT entries from the test, because it's impossible to give the correct pinyin for a 多音字 without any context.

In the first round, with excluding the single character entries, I run the test for 100.169 entries with 9471 mismatching pairs (CEDICT <--> pypinyin).

Next, I analysed the the results and realized that pypinyin has a somewhat contradictory policy on tone sandhis, it accounts for the change of 不 and 一 (e.g. in 不在) but not for 3rd tone + 3rd tone words, like 小姐. So I decided to exclude 不 and 一 to drill further deeper (and thus also excluded some cases where pypinyin was incorrect for some other reason than the tone sandhi notation). In the case of 不 and 一, marking tone change is a matter of opinion (but should be done consequently, not as in pypinyin).

In the 2nd round, I had 97.196 entries to test and got 8968 mismatching pairs. After scrolling through the list of mismatching results, I came to the conclusion that pypinyin has no support for the neutral tone (5), e.g. 什么 (shén me) is *shén mé for pypinyin. Final 子, like in 位子, are given as 3rd tone. Similarly, 儿 is always incorrectly er2, etc.

The other most common error is incorrect tone, other than those neutral syllables. These include (among others):

  • 们 *men2
  • 发 in 黑发 *hei1 fa1
  • 禁伐 is incorrectly *jin1 fa2
  • 近几年 *jin4 ji1 nian2, etc.

Incorrect initials make up another large group:

  • 调 tiao vs. diao
  • 朝 chao vs. zhao
  • 重 chong vs. zhong,
  • 传 chuan vs. zhuan, etc.
  • 弹琴 *dan qin
  • 厦门 *sha men
  • 社长 *she chang, etc.

Also many elementary words are given with incorrect pinyin:

  • 音乐 is *yin le in pypinyin (my "favorite")
  • 鱼露 is incorrectly *yu luo
  • 卡 as *qia in almost all cases, like 卡拉OK * qia la OK
  • 柏林 as *bai lin
  • 落下 *la xia

I have noticed in some cases that the Taiwan Mandarin pronunciation is favored by pypinyin, but the doc pages do not comment on this, which is obviously another issue, since users do have to know what to expect as output.

With all this, pypinyin has obvious shortcomings that make it unsuitable for professional use (you have to manually check every word to eliminate hard to spot errors) and also for building dictionaries or other material targeted at beginners. Even HSK level 2 or 3 expect most of the examples noted above to be used correctly.


UPDATE2:

I am considering to create a Python lib that uses the CEDICT file to transcribe Chinese characters to pinyin. My only concern is that the CEDICT file also contains inaccurate pinyins, so first I would need a reliable source to update incorrect pinyins in CEDICT. If anyone knows about such a file or API, please leave a comment.

Correct answer by imrek on October 23, 2020

For chrome there are a few extensions that can add pinyin to the current web page:

  1. Add Pinyin
  2. Simplified-Traditional-Converter (includes pinyin) github repo (replaces the original text with pinyin rather than appending). A while ago I forked DoctorLai's repo and made it work in append mode rather than replace but the fork is more of an experiment rather than ready for use.

Answered by ccpizza on October 23, 2020

Python's pypinyin module does what you want. Check it out at https://pypi.python.org/pypi/pypinyin.

I highly recommend taking Python's antigravity module for a spin as well, it's a real trip.

Answered by Master Sparkles on October 23, 2020

There are many tools, commercials, open sources, free tools or online services can do this job.

I don't know what are you gonna do, so it's hard to pick one for you.

Here are some that ease to use: online service http://pth.linqi.org/pyzd_biaozhu.html

Chinese font with pinyin: http://www.pinyinok.com/pyhzk.htm

free tool: http://www.putonghuaworld.com/computer/100604/100604_0101.htm

pro open source tool list: http://www.oschina.net/project/tag/446/pinyin

BTW: You can find this feature in the appropriate edition of Microsoft Word and Adobe InDesign.

Answered by wolfrevo on October 23, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP