# Algorithm for finding names in Chinese texts

Chinese Language Asked by Andrey Epifantsev on October 13, 2020

I am writing an application for extraction information from Chinese texts. One of the tasks is finding names (personal, geographical or something else). The algorithm is not required to find all 100% of the names. 50% is enough.

In European languages(for example English or Russian), I can detect names by first capital letters: if a word in the middle of the sentence begins with a capital letter then this word is name. This criterion is not 100% reliable and it does not allow to find all names but for my purpose it is enough.

I know that sometimes (but not always) name can be after 叫. But I do not know how long this name.

Could you tell me some sings (criterions, features) by which the algorithm can find the names in Chinese texts?

For example, in the sentence below, are 李三 张三 both name of people?

A deep-learning based algorithm is your best bet, for example:

https://zhuanlan.zhihu.com/p/61227299

https://github.com/wainshine/Chinese-Names-Corpus

Answered by Siyi Deng on October 13, 2020

Depending on your use case, you could consider an approach like the following:

1. Segment the text into words using some tried-and-trained machine learning algorithm (this step alone will already yield imperfect results).

2. Look up each word in the free and downloadable CC-Cedict dictionary.

3. Check if the Pinyin field for that word's CC-Cedict entry begins with a capital letter.

If you want some idea of how accurate such an approach would be, try playing around with the regex-based query here (disclaimer - shameless plug for my app; instructions available by clicking the ℹ icon).

4. If yes, judge that the word is a proper noun. Else, judge that it isn't.

[Edited - previous answer below. Above approach is simpler, probably more reliable, and doesn't require sending tons of network requests]

1. For each word (or string of isolated characters that have been judged not to form a word), use the Wikipedia API to check if a page exists for it, using a query something like this: action=query&format=json&prop=categories&titles=...&formatversion=2
2. If yes, check if one or more of the categories of said page match some variation of the following regex:
/(?:人|者|地名|城市)\$/

3. If yes, judge that the word is a proper noun. Else, judge that it isn't.

Either approach will be a little messy and not entirely reliable, but if you're lucky you might hit at least the 50% accuracy you're hoping for.

Answered by Lionel Rowe on October 13, 2020

It seems to be an impossible task. For example this sentence 加利福尼亚在美国。 There are two names in the sentence, the first one has five characters, and the second one has two characters. Unless you have come across these names before, you would not know. Names after 叫 must be a very tiny part of all places where a name springs up.

Answered by Xuehong Zhang on October 13, 2020

This doesn't work with all texts but it should provide a 100% grab for supported texts.

You can use parameters to find proper name marks throughout texts:

In Chinese writing, a proper name mark (Simplified Chinese: 专名号, zhuānmínghào; Traditional Chinese: 專名號) is an underline used to mark proper names, such as the names of people, places, dynasties, organizations. The related book name mark (Simplified Chinese: 书名号, shūmínghào; Traditional Chinese: 書名號) indicated by a wavy underline (﹏﹏) is used to mark the titles of publications or texts.

For example:

Qu Yuan was exiled, and thus composed the Li Sao. Zuo Qiu (or Zuoqiu1) lost his sight, hence there is the Guo Yu. (Sima Qian, Letter to Ren An)

(Underline doesn't seem to be supported, check the Wiki link for the better example.)

The problem is that:

The proper name mark is rarely used in modern Chinese publications, and the Guillemet (《 》or〈 〉) is more commonly used to indicate titles. It is occasionally used in Taiwan and Hong Kong in school textbooks. However, in scholarly editions of classical Chinese texts, especially vertically typeset texts (where they appear to the left of the text instead of underneath), use of both the proper name mark and the book name mark is common, as they help readers avoid misinterpretations of the text.

Answered by Mo. on October 13, 2020

## Related Questions

### Can someone help translate this seal script (I think)?

1  Asked on January 6, 2022 by joseph-tiller

### How to express “have fun!” or “enjoy!” in Chinese?

2  Asked on December 31, 2021

### Could you please correct my sentences?

2  Asked on December 31, 2021 by annaytrh

### What is the function of 先 in 我諗聽日先買飛嘅話一定買唔到 (cantonese)

3  Asked on December 31, 2021

### What does 民以食为天 mean?

6  Asked on December 26, 2021

### Why did my Chinese teacher describe Donald Trump as 口吐莲花?

6  Asked on December 24, 2021

### 了解 vs 明白 vs 知道 – what is the difference, and when should I use each one?

3  Asked on December 21, 2021 by ciaocibai

### What is the etymology of Chinese tense words?

1  Asked on December 21, 2021 by guset

### What does “过SOR了吗” mean?

5  Asked on December 19, 2021

### Help translating artist names and signatures

1  Asked on December 17, 2021 by meghan-mm

### Is there a chengyu to describe small-time thieves or scoundrels with a playful connotation?

2  Asked on December 14, 2021

### What’s the meaning of 龙舟队分哪个两个组？

5  Asked on December 12, 2021 by antxon

### What are the rules for the order of time adverbs?

1  Asked on December 10, 2021

### How do I handwrite 小心翼翼 well?

3  Asked on December 6, 2021

### How do you say “we only have one carrot left”?

7  Asked on December 4, 2021

### Can I replace 可以不用 with 不需要？

4  Asked on December 2, 2021

### What is the meaning of [你说什么是什么]?

10  Asked on December 2, 2021

### What does 就那样吧 mean?

5  Asked on November 30, 2021

### How do you say “You mean …?” to respond to what you don’t understand enough?

3  Asked on November 30, 2021