plano是什么意思| scofield是什么品牌| aimer是什么意思| 葫芦娃的爷爷叫什么| 肝脏在什么位置| 梦见鹦鹉是什么征兆| 西游记主题曲叫什么| 微信拥抱表情什么意思| 肺部钙化灶是什么意思| 福五行属什么| 孕早期吃什么水果| 啤酒对身体有什么好处| 一什么春天| 妈妈的表姐叫什么| 腱鞘炎有什么治疗方法| 眼皮肿什么原因引起的| primark是什么牌子| 什么是三净肉| 中药是什么| 阴道炎是什么原因引起的| 什么是adhd| 斗地主是什么意思| 修女是什么意思| 尿肌酐高说明什么| 为什么尽量抽混合型烟| 沉不住气什么意思| 海里有什么鱼| 湿气重吃什么食物| 淋巴结肿大吃什么食物好| 脂溢性脱发是什么意思| 总胆红素偏高什么意思| 乳腺结节摸着什么感觉| 干眼症吃什么药| pm什么意思| 后脑勺出汗是什么原因| 孕中期同房要注意什么| 为什么老是睡不着| 肋骨疼挂什么科| 体温偏高的人说明什么| 紫苏是什么| 血管瘤是什么引起的| 什么样的泥土| 什么的叶丛| 什么的河水| 什么饮料最解渴| 梦见被狗咬是什么预兆| 牛肉饺子馅配什么蔬菜| 什么值得买怎么用| 尿频是什么病| 成功是什么| 大象的耳朵有什么作用| 姜罚是什么| 青霉素过敏不能吃什么药| 吃什么药提高免疫力| 1933年属什么| 壬寅年五行属什么| 摆子是什么意思| tag是什么意思| 扁平疣用什么药膏除根| 胎动突然减少是什么原因| 喝藏红花有什么好处| 什么是血脂高| 梦见石头是什么意思| 空明什么意思| 镜花缘是什么意思| 偶发室性早搏是什么意思| 黄发指什么| 免疫球蛋白有什么作用| 琼玖是什么意思| 做什么生意好赚钱| 一拃长是什么意思| 木元念什么| 四月初五是什么星座| fbi是什么| 不惑是什么意思| 境字五行属什么| 一月十一是什么星座| 乱花渐欲迷人眼是什么意思| 浸润性癌是什么意思| 手术后吃什么最有营养| 长得标致是什么意思| 身强力壮是什么生肖| 什么把什么造句子| 音调是由什么决定的| 晚上两点是什么时辰| 雾化对小孩有什么影响或者副作用| 浅粉色配什么颜色好看| 三个句号代表什么意思| 三周年祭日有什么讲究| 子宫肌瘤吃什么药好| 唇珠是什么| 头晕出冷汗是什么原因| 儿童抽动症看什么科| 树欲静而风不止是什么意思| 男性生殖系统感染吃什么药| 打2个喷嚏代表什么| 高血压中医叫什么病| 雾化对小孩有什么影响或者副作用| 大便想拉又拉不出来是什么原因| 健身hit什么意思| 左胸隐隐作痛是什么原因| 劓刑是什么意思| 查甲状腺挂什么科| 岬是什么意思| 舌系带短挂什么科| 酸菜鱼是什么地方的菜| 吐口水有血是什么原因| 处事不惊是什么意思| 颞下颌紊乱挂什么科| 青黄不接是什么意思| 舌头挂什么科| 多五行属性是什么| 萝卜丁口红什么牌子| 减脂吃什么| 什么山不能爬脑筋急转弯| 甲状腺结节挂什么科室| 眼圈黑是什么原因| 肾气不足吃什么药| 梦见芝麻是什么意思| qs排名是什么意思| 宫颈管积液什么意思| mk属于什么档次| bp在医学上是什么意思| 晚上吃什么不长胖| 梦见别人流血是什么预兆| 嘴苦是什么情况| 一心一什么| 半边脸疼是什么原因| 霉菌性阴道炎用什么洗液好| 鸡蛋和什么食物相克| isis是什么组织| 6月8日是什么星座| 血管造影是什么检查| 什么姿势舒服| 吃什么预防脑梗| 湿疹是什么样的图片| 青少年手抖是什么原因| 腹股沟在什么位置| 皮肤发红发烫是什么原因| 男性hpv挂什么科| 激光脱毛挂什么科| 便秘不能吃什么食物| 为什么叫新四军| 益生元是什么| 倾情是什么意思| 尿红色是什么原因| 为什么女人要带阴环| 什么是工作| 肌酸有什么用| 低血压去药店买什么药| 醋精和白醋有什么区别| 口唇疱疹用什么药膏| 3月8号是什么星座| 银耳不能和什么一起吃| in77是什么意思| 眼屎多吃什么药| 检查眼睛挂什么科| 占位性病变是什么意思| 什么样的人容易高反| 8月初是什么星座| 女人血虚吃什么补最快| 走花路是什么意思| 矬是什么意思| ozark是什么牌子| 为什么会脚臭| 水痘可以吃什么| eau是什么意思| 云南是什么民族| 你掀起波澜抛弃了我是什么歌| 为什么青霉素要做皮试| halloween是什么意思| 屏蔽一个人意味着什么| 备孕前需要做什么检查| 丧是什么意思| 分泌是什么意思| ins风格是什么| 什么叫阴阳水| 甲状腺结节有什么感觉| 什么是接触性出血| 黄瓜敷脸有什么功效| 吃完饭就打嗝是什么原因| 肺部钙化是什么意思啊| 苦菜是什么菜| 什么肉好消化| 相爱相杀是什么意思| 属羊的和什么属相不合| 无意识是什么意思| 西梅不能和什么一起吃| 猥琐什么意思| 5月22号是什么星座| 九五年属什么| 右眼袋跳动是什么原因| 李子什么人不能吃| 惜字如金是什么意思| 处方药是什么标志| 吃了发芽的土豆会有什么症状| 吃汉堡为什么要配可乐| 煎牛排用什么油| 正月初六是什么星座| 卷心菜是什么菜| 妄想症吃什么药| 白头发吃什么变黑| 1月16日是什么星座| 初中学历能做什么工作| 眼睛肿疼是什么原因引起的| 朋友搬家送什么礼物好| 胃疼吃什么药最有效| 尿素酶阳性什么意思| 超导体是什么| 泡椒是什么辣椒| ab型血可以接受什么血型| 毒枭是什么意思| 17岁属什么| 姜字五行属什么| 还行吧是什么意思| 舌头两侧溃疡吃什么药| 什么是二手烟| 绿色的大便是什么原因| 病毒感染咳嗽吃什么药效果好| 湿热便秘吃什么中成药| 刮痧是什么原理| 喝什么茶对睡眠有帮助| 犟驴是什么意思| 肝功能是什么| 腿发软无力是什么原因引起的| 医保卡是什么样子的| 包块是什么| 纨绔子弟是什么意思| 知我者莫若你什么意思| 头孢长什么样| 寄居蟹用什么水养| 痔疮很痒是什么原因| 为什么同房会出血| 阿修罗是什么意思| 奥美拉唑什么时候吃| 肾囊肿有什么症状| 乳腺导管扩张吃什么药| 梅花什么时候开| 彪悍是什么意思| 胆汁酸高是什么意思| 长痘吃什么水果| 香港的别称是什么| 屁股上长痘痘是什么原因| 用凝胶排出豆腐渣一样的东西是什么原因| 手表什么牌子| 什么样的马| 无花果有什么营养| 带状疱疹吃什么药| 副司长是什么级别| 脾肾阳虚是什么意思| 艾玛是什么意思啊| 鱼用什么游泳| 什么的石头| 种马是什么意思| 怀孕不能吃什么水果| 观字五行属什么| 取环后吃什么恢复子宫| 眼睛周围长脂肪粒是什么原因| 1983年出生是什么命| 吃叶酸有什么副作用| exp是什么| 看日出是什么生肖| 经常性头疼是什么原因| 氨纶是什么| 百度

王者荣耀S8赛季新段位叫什么 S8赛季段位怎么

(Redirected from Lexical frequency analysis)
百度   医生判断,阿姨年纪不算很大,病情不应该发展的这么快。

A word list is a list of words in a lexicon, generally sorted by frequency of occurrence (either by graded levels, or as a ranked list). A word list is compiled by lexical frequency analysis within a given text corpus, and is used in corpus linguistics to investigate genealogies and evolution of languages and texts. A word which appears only once in the corpus is called a hapax legomena. In pedagogy, word lists are used in curriculum design for vocabulary acquisition. A lexicon sorted by frequency "provides a rational basis for making sure that learners get the best return for their vocabulary learning effort" (Nation 1997), but is mainly intended for course writers, not directly for learners. Frequency lists are also made for lexicographical purposes, serving as a sort of checklist to ensure that common words are not left out. Some major pitfalls are the corpus content, the corpus register, and the definition of "word". While word counting is a thousand years old, with still gigantic analysis done by hand in the mid-20th century, natural language electronic processing of large corpora such as movie subtitles (SUBTLEX megastudy) has accelerated the research field.

In computational linguistics, a frequency list is a sorted list of words (word types) together with their frequency, where frequency here usually means the number of occurrences in a given corpus, from which the rank can be derived as the position in the list.

Table 1: Example lexical frequency analysis
Type Occurrences Rank
the 3,789,654 1st
he 2,098,762 2nd
[...]
king 57,897 1,356th
boy 56,975 1,357th
[...]
stringyfy 5 34,589th
[...]
transducionalify 1 123,567th

Methodology

edit

Factors

edit

Nation (Nation 1997) noted the incredible help provided by computing capabilities, making corpus analysis much easier. He cited several key issues which influence the construction of frequency lists:

  • corpus representativeness
  • word frequency and range
  • treatment of word families
  • treatment of idioms and fixed expressions
  • range of information
  • various other criteria

Corpora

edit

Traditional written corpus

edit
 
Frequency of personal pronouns in Serbo-Croatian

Most of currently available studies are based on written text corpus, more easily available and easy to process.

SUBTLEX movement

edit

However, New et al. 2007 proposed to tap into the large number of subtitles available online to analyse large numbers of speeches. Brysbaert & New 2009 made a long critical evaluation of the traditional textual analysis approach, and support a move toward speech analysis and analysis of film subtitles available online. The initial research saw a handful of follow-up studies,[1] providing valuable frequency count analysis for various languages. In depth SUBTLEX researches over cleaned up open subtitles were produced for French (New et al. 2007), American English (Brysbaert & New 2009; Brysbaert, New & Keuleers 2012), Dutch (Keuleers & New 2010), Chinese (Cai & Brysbaert 2010), Spanish (Cuetos et al. 2011), Greek (Dimitropoulou et al. 2010), Vietnamese (Pham, Bolger & Baayen 2011), Brazil Portuguese (Tang 2012) and Portugal Portuguese (Soares et al. 2015), Albanian (Avdyli & Cuetos 2013), Polish (Mandera et al. 2014) and Catalan (2019[2]), Welsh (Van Veuhen et al. 2024[3]). SUBTLEX-IT (2015) provides raw data only.[4]

Lexical unit

edit

In any case, the basic "word" unit should be defined. For Latin scripts, words are usually one or several characters separated either by spaces or punctuation. But exceptions can arise : English "can't" and French "aujourd'hui" include punctuations while French "chateau d'eau" designs a concept different from the simple addition of its components while including a space. It may also be preferable to group words of a word family under the representation of its base word. Thus, possible, impossible, possibility are words of the same word family, represented by the base word *possib*. For statistical purpose, all these words are summed up under the base word form *possib*, allowing the ranking of a concept and form occurrence. Moreover, other languages may present specific difficulties. Such is the case of Chinese, which does not use spaces between words, and where a specified chain of several characters can be interpreted as either a phrase of unique-character words, or as a multi-character word.

Statistics

edit

It seems that Zipf's law holds for frequency lists drawn from longer texts of any natural language. Frequency lists are a useful tool when building an electronic dictionary, which is a prerequisite for a wide range of applications in computational linguistics.

German linguists define the H?ufigkeitsklasse (frequency class)   of an item in the list using the base 2 logarithm of the ratio between its frequency and the frequency of the most frequent item. The most common item belongs to frequency class 0 (zero) and any item that is approximately half as frequent belongs in class 1. In the example list above, the misspelled word outragious has a ratio of 76/3789654 and belongs in class 16.

 

where   is the floor function.

Frequency lists, together with semantic networks, are used to identify the least common, specialized terms to be replaced by their hypernyms in a process of semantic compression.

Pedagogy

edit

Those lists are not intended to be given directly to students, but rather to serve as a guideline for teachers and textbook authors (Nation 1997). Paul Nation's modern language teaching summary encourages first to "move from high frequency vocabulary and special purposes [thematic] vocabulary to low frequency vocabulary, then to teach learners strategies to sustain autonomous vocabulary expansion" (Nation 2006).

Effects of words frequency

edit

Word frequency is known to have various effects (Brysbaert et al. 2011; Rudell 1993). Memorization is positively affected by higher word frequency, likely because the learner is subject to more exposures (Laufer 1997). Lexical access is positively influenced by high word frequency, a phenomenon called word frequency effect (Segui et al.). The effect of word frequency is related to the effect of age-of-acquisition, the age at which the word was learned.

Languages

edit

Below is a review of available resources.

English

edit

Word counting is an ancient field,[5] with known discussion back to Hellenistic time. In 1944, Edward Thorndike, Irvin Lorge and colleagues[6] hand-counted 18,000,000 running words to provide the first large-scale English language frequency list, before modern computers made such projects far easier (Nation 1997). 20th century's works all suffer from their age. In particular, words relating to technology, such as "blog," which, in 2014, was #7665 in frequency[7] in the Corpus of Contemporary American English,[8] was first attested to in 1999,[9][10][11] and does not appear in any of these three lists.

The Teachers Word Book of 30,000 words (Thorndike and Lorge, 1944)

The Teacher Word Book contains 30,000 lemmas or ~13,000 word families (Goulden, Nation and Read, 1990). A corpus of 18 million written words was hand analysed. The size of its source corpus increased its usefulness, but its age, and language changes, have reduced its applicability (Nation 1997).

The General Service List (West, 1953)

The General Service List contains 2,000 headwords divided into two sets of 1,000 words. A corpus of 5 million written words was analyzed in the 1940s. The rate of occurrence (%) for different meanings, and parts of speech, of the headword are provided. Various criteria, other than frequence and range, were carefully applied to the corpus. Thus, despite its age, some errors, and its corpus being entirely written text, it is still an excellent database of word frequency, frequency of meanings, and reduction of noise (Nation 1997). This list was updated in 2013 by Dr. Charles Browne, Dr. Brent Culligan and Joseph Phillips as the New General Service List.

The American Heritage Word Frequency Book (Carroll, Davies and Richman, 1971)

A corpus of 5 million running words, from written texts used in United States schools (various grades, various subject areas). Its value is in its focus on school teaching materials, and its tagging of words by the frequency of each word, in each of the school grade, and in each of the subject areas (Nation 1997).

The Brown (Francis and Kucera, 1982) LOB and related corpora

These now contain 1 million words from a written corpus representing different dialects of English. These sources are used to produce frequency lists (Nation 1997).

French

edit
Traditional datasets

A review has been made by New & Pallier. An attempt was made in the 1950s–60s with the Fran?ais fondamental. It includes the F.F.1 list with 1,500 high-frequency words, completed by a later F.F.2 list with 1,700 mid-frequency words, and the most used syntax rules.[12] It is claimed that 70 grammatical words constitute 50% of the communicatives sentence,[13][14] while 3,680 words make about 95~98% of coverage.[15] A list of 3,000 frequent words is available.[16]

The French Ministry of the Education also provide a ranked list of the 1,500 most frequent word families, provided by the lexicologue étienne Brunet.[17] Jean Baudot made a study on the model of the American Brown study, entitled "Fréquences d'utilisation des mots en fran?ais écrit contemporain".[18]

More recently, the project Lexique3 provides 142,000 French words, with orthography, phonetic, syllabation, part of speech, gender, number of occurrence in the source corpus, frequency rank, associated lexemes, etc., available under an open license CC-by-sa-4.0.[19]

Subtlex

This Lexique3 is a continuous study from which originate the Subtlex movement cited above. New et al. 2007 made a completely new counting based on online film subtitles.

Spanish

edit

There have been several studies of Spanish word frequency (Cuetos et al. 2011).[20]

Chinese

edit

Chinese corpora have long been studied from the perspective of frequency lists. The historical way to learn Chinese vocabulary is based on characters frequency (Allanic 2003). American sinologist John DeFrancis mentioned its importance for Chinese as a foreign language learning and teaching in Why Johnny Can't Read Chinese (DeFrancis 1966). As a frequency toolkit, Da (Da 1998) and the Taiwanese Ministry of Education (TME 1997) provided large databases with frequency ranks for characters and words. The HSK list of 8,848 high and medium frequency words in the People's Republic of China, and the Republic of China (Taiwan)'s TOP list of about 8,600 common traditional Chinese words are two other lists displaying common Chinese words and characters. Following the SUBTLEX movement, Cai & Brysbaert 2010 recently made a rich study of Chinese word and character frequencies.

Other

edit

Wiktionary contains frequency lists in more languages.[21]

Most frequently used words in different languages based on Wikipedia or combined corpora.[22]

See also

edit

Notes

edit
  1. ^ "Crr ? Subtitle Word Frequencies".
  2. ^ Boada, Roger; Guasch, Marc; Haro, Juan; Demestre, Josep; Ferré, Pilar (1 February 2020). "SUBTLEX-CAT: Subtitle word frequencies and contextual diversity for Catalan". Behavior Research Methods. 52 (1): 360–375. doi:10.3758/s13428-019-01233-1. ISSN 1554-3528. PMID 30895456. S2CID 84843788.
  3. ^ van Heuven, Walter JB; Payne, Joshua S; Jones, Manon W (May 2024). "SUBTLEX-CY: A new word frequency database for Welsh". Quarterly Journal of Experimental Psychology. 77 (5): 1052–1067. doi:10.1177/17470218231190315. ISSN 1747-0218. PMC 11032624. PMID 37649366.
  4. ^ Amenta, Simona; Mandera, Pawe?; Keuleers, Emmanuel; Brysbaert, Marc; Crepaldi, Davide (7 January 2022). "SUBTLEX-IT".
  5. ^ Bontrager, Terry (1 April 1991). "The Development of Word Frequency Lists Prior to the 1944 Thorndike-Lorge List". Reading Psychology. 12 (2): 91–116. doi:10.1080/0270271910120201. ISSN 0270-2711.
  6. ^ The teacher's word book of 30,000 words.
  7. ^ "Words and phrases: Frequency, genres, collocates, concordances, synonyms, and WordNet".
  8. ^ "Corpus of Contemporary American English (COCA)".
  9. ^ "It's the links, stupid". The Economist. 20 April 2006. Retrieved 2025-08-07.
  10. ^ Merholz, Peter (1999). "Peterme.com". Internet Archive. Archived from the original on 2025-08-07. Retrieved 2025-08-07.
  11. ^ Kottke, Jason (26 August 2003). "kottke.org". Retrieved 2025-08-07.
  12. ^ "Le fran?ais fondamental". Archived from the original on 2025-08-07.
  13. ^ Ouzoulias, André (2004), Comprendre et aider les enfants en difficulté scolaire: Le Vocabulaire fondamental, 70 mots essentiels (PDF), Retz - Citing V.A.C Henmon (dead link, no Internet Archive copy, 10 August 2023)
  14. ^ Liste des "70 mots essentiels" recensés par V.A.C. Henmon
  15. ^ "Generalities".
  16. ^ "PDF 3000 French words".
  17. ^ "Maitrise de la langue à l'école: Vocabulaire". Ministère de l'éducation nationale.
  18. ^ Baudot, J. (1992), Fréquences d'utilisation des mots en fran?ais écrit contemporain, Presses de L'Université, ISBN 978-2-7606-1563-2
  19. ^ "Lexique".
  20. ^ "Spanish word frequency lists". Vocabularywiki.pbworks.com.
  21. ^ Wiktionary:Frequency lists, 21 July 2024
  22. ^ Most frequently used words in different languages, ezglot

References

edit

Theoretical concepts

edit

Written texts-based databases

edit

SUBTLEX movement

edit
血虚肝旺有什么症状有哪些 大姨妈是什么意思 擦汗表情是什么意思 伊朗用什么货币 荨麻疹是由什么引起的
来大姨妈不能吃什么 男性生殖系统感染吃什么药 牛顿三大定律是什么 故的偏旁是什么 惊弓之鸟是什么意思
中国一词最早出现在什么时候 昧是什么意思 心脏扩大吃什么药好 胆囊结石用什么药好 2003年属什么
1700年是什么朝代 蝴蝶是什么变的 春节为什么要放鞭炮 血脂高是什么原因引起 胎监是检查什么的
异象是什么意思hlguo.com 美林是什么药clwhiglsz.com 荔枝什么季节成熟hcv8jop8ns9r.cn 心脏t波改变吃什么药hcv8jop9ns1r.cn 女人吃秋葵有什么好处hcv9jop2ns4r.cn
脚气什么样hcv8jop1ns8r.cn 上火吃什么最快能降火clwhiglsz.com 什么原因导致有幽门杆菌hcv7jop7ns2r.cn 心电图低电压什么意思xinmaowt.com 荒唐是什么意思hcv8jop1ns0r.cn
入睡困难吃什么药效果最好hcv9jop2ns3r.cn 倒班什么意思hcv8jop2ns2r.cn 卧室养什么花好hcv8jop6ns6r.cn 轻浮的女人是什么意思xianpinbao.com 霉菌性阴炎用什么药好得快hcv9jop2ns9r.cn
什么时间量血压最准确bjcbxg.com 指甲长出来是白色的什么原因hcv8jop5ns3r.cn 韭菜苔炒什么好吃hcv9jop0ns0r.cn 龟头炎用什么药好hcv8jop3ns3r.cn 什么是桥本甲状腺炎hcv9jop2ns7r.cn
百度