The Sino-Tibetan Language Family

Description of the Sino-Tibetan Language Family


Sino-Tibetan (ST) is one of the largest language families in the world, with more first-language speakers than even Indo-European. The more than 1.1 billion speakers of Sinitic (the Chinese dialects) constitute the world's largest speech community. ST includes both the Sinitic and the Tibeto-Burman languages. Most scholars in China today take an even broader view of ST (called Hàn-Zàng in Mandarin), including not only these two branches, but Tai (= "Daic") and Hmong-Mien (= Miao-Yao) as well. Even taking ST in its narrower sense, we are dealing with a highly differentiated language family of formidable scope, complexity, and time-depth. Tibeto-Burman (TB) comprises hundreds of languages besides Tibetan and Burmese, spread over a vast geographical area (China, India, the Himalayan region, peninsular SE Asia).

Homeland and time-depth of Sino-Tibetan

The Proto-Sino-Tibetan (PST) homeland seems to have been somewhere on the Himalayan plateau, where the great rivers of East and Southeast Asia (including the Yellow, Yangtze, Mekong, Brahmaputra, Salween, and Irrawaddy) have their source. The time of hypothetical ST unity, when the Proto-Han (= Proto-Chinese) and Proto-Tibeto-Burman (PTB) peoples formed a relatively undifferentiated linguistic community, must have been at least as remote as the Proto-Indo-European period, perhaps around 4000 B.C.

The TB peoples slowly fanned outward along these river valleys, but only in the middle of the first millennium A.D. did they penetrate into peninsular Southeast Asia, where speakers of Austronesian (=Malayo-Polynesian) and Mon-Khmer (Austroasiatic) languages had already established themselves by prehistoric times. The Tai peoples began filtering down from the north at about the same time as the TB's. The most recent arrivals to the area south of China have been the Hmong-Mien (Miao-Yao), most of whom still live in China itself.

About Sino-Tibetan linguistics

The field of ST linguistics is only about 50 years old, and has been a flourishing object of inquiry for only the past 25. Scholars have been trying since the mid-19th century to situate Chinese in a wider genetic context. As the relationships between Chinese and Tibetan on the one hand, and Tibetan and Burmese on the other became obvious, vague notions of an "Indo-Chinese" family (Hodgson 1853, Conrady 1896) began to crystallize. The term Sino-Tibetan seems to have been used first by R. Shafer (1939-41, 1966/67), who conceived of it as a tripartite linguistic stock comprising Chinese, Tibeto-Burman (TB), and Tai (= "Daic"). Much of the area in which TB languages are spoken is still virtually inaccessible for linguistic fieldwork, at least by foreign scholars (NE India, Burma, Yunnan, Sichuan, Tibet, Laos, Vietnam). Only in Thailand and Nepal has vigorous international fieldwork been carried on since the 1960's.

The components of Sino-Tibetan

The Chinese Component

By any criterion (number of speakers, antiquity of documented written history, cultural significance, influence on other languages) Chinese ranks as one of the most important languages in the world. Yet the non-alphabetic nature of the Chinese writing system has posed unique problems for the historical linguist trying to reconstruct the phonology of earlier stages of the language, or establish a genetic connection between Chinese and other languages.

The great Swedish Sinologist, Bernhard Karlgren, basing his work on the pioneering philological research of 18th and 19th c. Chinese scholars, devoted some 35 years to the phonological reconstruction of the pronunciation of thousands of Chinese characters (Karlgren 1923, 1954, 1957). Karlgren recognized two earlier stages of the language: (1) "Ancient Chinese" (now usually called Middle Chinese "MC"), spoken during the second half of the 1st millennium A.D., and (2) "Archaic Chinese" (now usually called Old Chinese "OC"), spoken during the early Zhou (= Chou) dynasty at the beginning of the first millennium B.C.

The reconstruction of MC is based mainly on the "rhyme-books" produced by contemporary Chinese literati, especially the Qie Yun (602 A.D.), wherein each character was given a phonetic value by glossing it with 2 others, the first of which had the same initial consonant as the target character, while the second had the same "rhyme" (i.e. vowel, final consonant if any, and tone) as the target character.

The tools available for the reconstruction of OC are much more indirect and tricky to use: the patterns of rhyming in the earliest Zhou texts, especially the Book of Odes (Shi Jing); and the graphological structure of the characters themselves, the vast majority of which are constructed of two elements, a radical which gives a clue to its meaning, and a phonetic which gives a clue to its pronunciation. (But no more than a clue: it cannot be assumed that all the characters in a given "phonetic series" had exactly the same initial and rhyme.)

Despite the brilliant successes of Karlgren's methods, they have certain severe inherent limitations. First of all, the phonological system implied by the Qie Yun is forbiddingly complex, lending credence to the view that it does not represent the speech of any single dialect of the time (not even that of the Tang capital, Chang-an), but is rather pan-dialectal, noting distinctions made in any dialect with which the compilers (who came from various regions, as stated explicitly in the Preface) happened to be familiar. Secondly, there is no reason to suppose that the MC phonological system of the Qie Yun was the exact lineal descendant of the OC system deduced from the Shi Jing rhymes and the graphic structure of the characters (in the sense that, e.g. Modern High German "descends from" Middle High and Old High German). Thirdly, certain modern groups of dialects, especially the Min dialects of Fujian (= Fukien) and adjacent regions in SE China, have undergone distinctive phonological developments that are impossible to trace back to the presumed MC system of the rhyme-books (see Chang and Chang 1972; Norman 1974, 1988).

Despite the ingenuity of Karlgren's successors in patching up this or that aspect of his reconstructions -- or perhaps because of this very ingenuity -- Chinese historical phonology has until recently been in danger of degenerating into a kind of scholasticism: endless reinterpretations of the same data. For no matter how rich the material on earlier stages of a single language may be, one can only go so far by the methods of "internal reconstruction". A tripod cannot stand on a single leg. For further progress in ST comparative/historical linguistics, it is necessary to look well beyond Chinese.

The Tibeto-Burman Component

The key component of ST, the branch with the most numerous and highly differentiated individual languages, is TB. The existence of the TB family was posited as early as the 1850's, when it was noticed that many words in "Written Tibetan" (WT), attested since the 7th c. A.D., appeared cognate to forms in "Written Burmese" (WB), attested since the early 12th c. A.D. British scholars and colonial administrators in India and Burma began to study some of the dozens of little-known "tribal" languages of the region that seemed to be genetically related to the two great literary languages, Tibetan and Burmese. This early work was collected in the monumental Linguistic Survey of India [Grierson and Konow 1903-28], three volumes of which (Vol. III, Parts 1,2,3) are devoted to wordlists and brief texts from TB languages.

Subsequent sporadic attempts to find cognates between Tibetan and Chinese [e.g. Simon 1929] did not get very far, in the absence of any serious scheme for the reconstruction of Proto-Tibeto-Burman (PTB). It remained for the eccentric American amateur comparativist Robert Shafer to embark on a systematic project to assemble all the material then available on TB languages, and to venture a division of the family into subgroups. Much of this earlier data had been collected by colonial administrators or missionaries who had spent years living among the people whose languages they studied, and a number of the grammars and dictionaries they produced are of lasting value. (Fraser; Hanson; Lorrain 1907; Lorrain 1940; Lorrain 1951, Mainwaring, Pettigrew, etc.) Shafer was assisted in this Depression-era WPA project by a talented junior collaborator, Paul K. Benedict, along with a motley team of half-trained indigents who spent their time combing through dictionaries and wordlists. The results were enshrined in a multi-volumed unpublished manuscript (1939-41) called Sino-Tibetan Linguistics (see Benedict 1975).

Shafer went on to publish his great work, Introduction to Sino-Tibetan (1966/67), where he included Tai in the ST family, and offered a complex and detailed subgrouping of TB into "divisions," "sections," "branches," and "units". Despite the illusory nature of this precision, given the inadequate quality of the data then available for the various subgroups of TB, Shafer's classificatory schema has been adopted unquestioningly by many non-specialists.

Benedict, basing his work on the same data-base as Shafer, arrived at different conclusions. In an unpublished manuscript entitled Sino-Tibetan: a Conspectus (ca. 1941), he first of all banished the Tai languages from ST, leaving only Chinese on the one hand, and TB on the other. As for the internal subgrouping of TB, though Benedict generally followed Shafer in setting up 8 major TB "nuclei", he refrained from trying to relate these by family-trees or Stammbäume of the traditional type, preferring to stress that many TB languages had so far resisted precise classification, and that the subgroups which could safely be established seemed to interrelate in ways too complex for a simple tree-diagram.

When a revised and heavily annotated version of the Conspectus (henceforth STC) was finally published in 1972 with J. Matisoff as contributing editor, this agnostic view of the internal structure of TB was retained, with most of the family pictured as "radiating out" of the geographically central Kachin (= Jinghpaw = Jingpho) language of N. Burma, and the Karen languages singled out as being furthest away from this central nucleus.

STC offers close to 500 TB etymologies, as well as over 300 suggested cognates between PTB and Old Chinese. In spite of its shortcomings, its publication ushered in the modern era of ST studies, and has become recognized as the point of departure for future work in the field.

Tibeto-Burman languages and their subgrouping

Though the total number of TB speakers is only about 56 million, smaller than for Tai-Kadai or Mon-Khmer/Austroasiatic, the number of individual TB languages is the largest of any family in E/SE Asia. The relatively low overall total for TB is the fact that its most populous language, Burmese, only has about 22 million speakers, while the number of Thai (45.5 million) and Vietnamese (55.4 million) speakers has increased rapidly in recent decades.

Language names

Of the more than 1400 TB language names I have collected (Matisoff 1986), many are only multiple designations for the same language or dialect, Any given language is likely to be known by several different names: its autonym (what the people call themselves), and perhaps several exonyms (what other groups call them). Some of the latter may be loconyms (e.g. the name of a conspicuous village where the language is spoken, or of a nearby river). Thus, the 20,000 speakers of a certain language of Nagaland call themselves and their language Memi (and used to call themselves Imemai), but they and their language are now known to outsiders either as Mao, or as Sopvoma (the name of their principal village). The 40,000 speakers of Lotha Naga are called Chizima, Choimi, and Miklai by the neighboring Angami, Sema, and Assamese, respectively. Conversely, quite different languages are often called by the same or very similar names: Nung is both a Central Tai language and a Tibeto-Burman language of the Nungish group; Mru is a TB language of the Kuki-Chin group, but Maru, also TB, belongs to the Burmish group; Kham(s) is both a dialect of Tibetan and a separate language of central Nepal.

Old names (paleonyms) now felt to be pejorative are rapidly being replaced by new ones (neonyms). We are now, e.g., expected to say Yi, Mizo, Adi, Nishi, Karbi (instead of Lolo, Lushai, Abor, Dafla, and Mikir, respectively), even though these older terms have been enshrined in the Western literature on TB languages for a century. Nomenclatural proliferation continues apace, perhaps at a faster rate than ever before. It has recently been proposed to differentiate among approximately 25 Yi (Loloish) dialects of China by using the pronunciations of the vowels in their common autonym, e.g. Nasu, Nosu, Nusu, Neuseu, Nesu, Naso, etc. (Chen Kang, p.c. 1988).

A further complication is the fact that many language names are used in both a narrower and a broader sense, sometimes referring to one specific language, but often to a whole group of linguistically or culturally related languages. Often very small or vulnerable groups will call themselves by the name of a somewhat larger or more prestigious neighbor, often hesitating to reveal their true ethnicity to outsiders. The tiny Anal people, an "Old Kukish" group of 6600 speakers in Burma and Bangladesh, "declared themselves to be Nagas in 1963 (Marrison 1967, p. 379). There is even a trend in Nagaland to artificially create larger linguistic/ethnic groups by combining syllables of several individual names, e.g. Chakhesang (from Chokri, Khezha, and Sangtam) and Zeliang (from Zemi and Liangmai).

With all this in mind, my best estimate is that the TB family contains at least 250 separate languages, which may be broken down into population categories as shown below. For about half of the languages in category (8), accurate population figures are not available, andsome of them may be in danger of extinction.

Distribution of TB languages by number of speakers

Number of Speakers Number of Languages
(1) over 1,000,000 9
(2) 500,000-999,000 12
(3) 250,000-499,000 11
(4) 100,000-249,000 16
(5) 50,000-99,000 16
(6) 25,000-49,000 27
(7) 10,000-24,000 44
(8) less than 10,000 123

Subgrouping of Tibeto-Burman

The most extensive account of the problems involved in attempting to subgroup TB in the light of our present knowledge is Matisoff 1978. As a working hypothesis, I have modified the unwieldy model presented in Benedict 1972 in several respects. For the new TB family tree that I propose as a heuristic model is above.


Benedict's Kuki-Chin-Naga, Abor-Miri-Dafla, and Bodo-Garo subgroups, spoken in NE India and adjacent regions of Burma, are lumped together under the purely geographical rubric of Kamarupan (from Kamarupa, the Sanskrit term for Assam). These languages constitute the center of diversification of the whole TB family. Nagaland alone, with an area of only 6350 sq. mi., is home to some 90 Tibeto-Burman languages and dialects. With a few exceptions, e.g. Lushai (Lorrain 1940), Tangkhul Naga (Pettigrew 1918, Bhat 1969), Garo (Burling 1961), Tiddim Chin (Henderson 1965), Bawm (Schwerli 1979), these "Indospheric" TB languages have been poorly recorded until recently, and many are still hardly known at all.

Recent research is revealing how interesting, diversified, and important these languages are. An invaluable compendium of older data on the Naga languages is Marrison 1967, a source which was extensively utilized in the comparative study of the Northern Naga subgroup by W. French (1983). A. Weidert (1987) is a sophisticated and data-packed study of the phonology of Kamarupan languages, marred only by its disorganized presentation and over-formalistic approach. New raw material on Kamarupan languages is becoming increasingly available in the publications of the Linguistic Circle of Nagaland (Kohima) and the Central Institute of Indian Languages (Mysore), and through the efforts of energetic scholar administrators like K. Das Gupta in Arunachal Pradesh [see Das Gupta 1963, 1968, 1971]. Yet a great deal of work remains to be done in this area of TB, and it would be unrealistic to attempt a precise subgrouping of Kamarupan at the moment, i.e. a clarification of the higher-order relationships of the subgroups traditionally designated as Kuki-Chin-Naga, Bodo-Garo (= Barish), and Abor-Miri-Dafla (= Mirish). Several important languages seem to fall outside any of these groups, e.g. Mikir (Grüssner 1978, Walker 1925), Meithei (Thoudam 1980), and Mru (Löffler 1966). Of all these languages, the Mirish ones seem most lexically aberrant from the viewpoint of TB in general, even in its numerals (JAM 1995).


The most exciting recent development in TB studies is the discovery of a new branch of the family, hitherto virtually unknown to Western scholars. These are the Qiangic languages of Sichuan. Extensive lexical and grammatical material has been collected on a dozen languages of the Qiangic group (Lu Shaozun 1980; Sun 1981, 1985, 1990). Besides Qiang, other languages in the group include Baima, Ergong, Ersu/Tosu, Gyarong (=rGyarong), Guiqiong, Muya, Namuyi, Pumi, Shixing, Zhaba. Ersu/Tosu is perhaps an indirect descendant of the extinct Xixia (=Hsi-hsia=Tangut) language, spoken in a once-powerful empire in the Tibetan-Chinese-Uighur border regions, finally destroyed by the Mongols in the 13th c. A large literature in Xixia survives, in a logographic writing system invented in the 11th c., with thousands of intricate characters inspired by, but graphically independent of Chinese, the decipherment of which is now well-advanced by Japanese and Russian scholars (Nishida 1964/66, Sofronov 1968). It was thought at first that Xixia was a Loloish language, but it now seems more likely that it belongs to the Qiangic group.

From the limited data so far made available, the Qiangic languages are of unusual interest, both synchronically and diachronically. They are characterized by initial consonant clusters comparable in complexity to those of Written Tibetan. Many of these are clearly secondary, resulting from the reduction of disyllabic compounds (see 7.3 below). Some languages of the group are tonal, while others are not, providing an ideal terrain for the investigation of the mechanisms of tonogenesis.


Himalayish comprises such relatively well-known languages as Tibetan, Lepcha (Sikkim; see Mainwaring), and Newari (spoken in the Kathmandu valley of Nepal; see Malla 1984, Genetti 1990), as well as dozens of others, some on the verge of extinction. Progress has been particularly impressive in the study of the TB languages of Nepal, especially those of the Tamang-Gurung-Thakali-Manang group (Mazaudon 1971, 1978; Glover); Kham-Magar (Watters 1975, 1985); Chepang (Caughley); Sunwar (Genetti); and the "Rai" or "Kiranti" languages of E. Nepal, which are generally characterized by complex inflectional morphology. (Allen 1975, Michailovsky 1981, Winter 1985, van Driem 1987, 1991). The westernmost languages in the TB family, e.g. Pattani (= Manchati), belong to the Himalayish group, and are beginning to be studied by Indian scholars (Sharma 1982).

Himalayish languages generally preserve prefixes and initial clusters well, along with final -s, -r, and -l. Written Tibetan is consonantally the most archaic attested TB language, preserving e.g. initial clusters that had disappeared from Chinese a millennium before.


Kachinic, like Karenic, is relatively undifferentiated, consisting basically of a single language and its dialects. Kachin (= Jingpho), spoken in northernmost Burma and adjacent parts of China and India, is well known, thanks to Hanson's dictionary (1906/1954), its (unpublished) revision by Maran, and recent work by Chinese scholars (1981, 1983). The name "Kachin" is also used loosely for various Burmish groups of N. Burma (Atsi, Lashi, Maru). Since Kachinic shows phonological and lexical similarities with several other branches of TB (Kamarupan, Himalayish, Lolo-Burmese [Matisoff 1974]), it has been considered to be genetically central in the TB family, just as it is geographically central (STC, p.6; Burling 1971). The Nungish languages (Lo 1945, Sun 1982, LaPolla 1986) seem closest to Kachinic, though it is too early to tell whether they also have a special relationship to the Qiangic group.


Burmese, attested since the 7th c. A.D., is one of the best-known TB languages. (Good modern grammars are Okell 1969, Wheatley 1982.) The languages of the N. Loloish subgroup (called "Yi" in China) are firmly within the "Sinosphere", and many of them have been well recorded by Chinese scholars (Fu 1950; Gao 1955, 1958; Ma 1951, 1958; Yuan 1953). The Central and Southern Loloish languages are spoken as far south as Thailand and Laos, where Western and Japanese scholars have had access to them since the 1960's (see Hope 1974; Lewis 1968; Srinuan 1976; Nishida 1966/67). More detailed comparative-historical work has been done on Loloish than on any other branch of TB (Bradley 1978; Burling 1967; Hansson 1989; Matisoff 1968, 1970, 1972a, 1973, 1974, 1978, 1979, 1991; Nishida 1966/67; Thurgood 1981; Wheatley 1973).

Loloish has strictly monosyllabic morphemes, few initial clusters or final consonants, often complex tone-systems, and a penchant for compounding as its chief morphological device. The Loloish language with the most speakers and greatest dialectal differentiation is Lolo (Yi) itself, with 5 million speakers in Sichuan, Yunnan, and Guangxi, and a syllabic writing system of considerable antiquity (Ma: Cuanwen Congke). The tribal TB language that has been studied in greatest detail is Lahu (Central Loloish) (Bradley "Lahu dialects"; Matisoff 1969a, 1969b, 1972b, 1973/82, 1976, 1988, 1989, 1991). The Naxi/Moso language is close to the Loloish nucleus, and is of special interest because of its complex, hieroglyphic-like writing system (see Okrand 1974; Rock 1963; Matisoff 1972 [TSR], "Jiburish").


In my view the Karenic group of the Thai-Burmese borderlands should be considered as just another subgroup of TB, and need not be singled out as having split off from the rest of the family at an especially early date. The argument for the special status of Karen is mostly syntactic. Alone of all TB languages (except for the heavily Sinicized Bai), Karen has its objects coming after its verbs. Now that we realize that syntactic change can easily occur (either for language-internal reasons or as the result of close contact with other languages), this is a less persuasive criterion for genetic classification. Karen has been under heavy influence from Mon and Thai (both SVO languages). Atendency for the rightward "hopping over the verb" of certain nominal arguments (especially locative NP's) has also been pointed out for N. Loloish languages under Chinese influence (Wheatley 1982), yet there is no reason at all not to consider them to belong to "TB proper."

The Karen languages are only beginning to receive the attention they deserve. The early comparative work of R. B. Jones (1961) requires fundamental revision in the light of Haudricourt's contributions (1946, 1975). The publication of research now in progress (e.g. E. J. A. Henderson's dictionary of Bwe Karen and D. Solnit's grammar of Eastern Kayah Karen) will dramatically improve our knowledge of this key branch of TB.