My First Year with Computational Linguistics

Posted Oct 27, 2024 Updated Jan 11, 2025

By Yiran Rex Ma

42 min read

How Did a Rural Child Find His Way to Linguistics?

My journey diverges from the conventional path.

I’ve always had a peculiar affinity for language—a trait that sets me apart from many who remain indifferent to its intricacies. As family lore has it, even as a young child, I could masterfully mimic the formal Mandarin of television news anchors, despite growing up in an environment where standard Mandarin was hardly the norm. Throughout my formative years, I frequently found myself in roles requiring public speaking, whether as an announcer or host. It wasn’t until I ventured beyond Sichuan that I became acutely aware of the heavy nasal and guttural qualities in my pronunciation. Recently, during a lecture where our professor played recordings of Professor Zhao Yuanren imitating various Chinese dialects, I found myself reflecting on these personal anecdotes.

My experience with English follows a similar pattern. I hail from a humble farming family and clan where formal education was sparse. In such circles, learning English was viewed as a “white elephant”—an impractical luxury, often met with class-based resentment. This mindset mirrors that of contemporary voices advocating for the removal of English from college entrance examinations—likely sharing significant demographic overlap with my family background. They seem oblivious to the class disparities at play, and how proficiency in English—much like standard Mandarin—can simultaneously bridge and widen these social gaps.

My formal English education began in middle school at age twelve—a point when, according to second language acquisition theory, deliberate study becomes necessary for mastery. Unlike children from middle-class families in fourth-tier cities, I had no prior exposure to the “enrichment activities” typical of urban upbringings. I was certainly aware of such privileged peers—the “rising stars” who seemed to inhabit a different world entirely. While I was an exceptionally sensitive child who would weep into the telephone over a single wrong answer on an inconsequential quiz, I remarkably never felt envy toward these prodigious classmates.

Despite my background, I consistently ranked first in English among my peers at a highly competitive junior high school filled with students who were either “privileged or studious” or “both privileged and studious.” This pattern continued well into high school. Recently, while watching “Young Sheldon,” I found myself deeply relating to the protagonist: like him, I had humble, modestly educated parents, making my academic achievements seem almost miraculous—creating dramatic contrasts with everyone around me.

For a small-town score achiever (小镇做题家), English was the catalyst that transformed me from a slightly above-average rural student into someone who could compete with more advantaged peers, though often labeled as “promising but inconsistent.” Now, two years removed from that highly standardized environment, I can smile at the notion of “inconsistency”—I’ve come to realize I wasn’t born for standardized testing. English had artificially elevated my perceived abilities, placing me in environments that exceeded my actual comprehensive capabilities. This “artificial elevation” has fueled a complex mix of dissatisfaction, defiance, and a desire for vindication over the past two years.

My linguistic journey took an unexpected turn in 2017 when I began writing songs, inspired by the rise of Chinese hip-hop. Later, influenced by Lexie, I experimented with code-switching between Chinese and English, gradually exploring syntax, grammar, and morphology beyond textbook conventions. I vividly remember writing my first rap, struggling to structure verses, studying other rappers’ work for guidance. Now, verse segmentation comes naturally—four lines per section, seamlessly incorporating verses, bridges, choruses, refrains. Similarly, my English evolved from basic subject-verb-object sentences beginning with “I” to a more fluid, intuitive approach where thought groups become rhythmic tools, occasionally sparking clever wordplay.

During this period—a story I’ve never shared before—I tearfully expressed my desire to study abroad to my mother. Influenced by Lexie, I believed only overseas education could fully nurture my linguistic talents. Predictably, the answer was no. From my parents’ limited perspective, they’d seen cases where “so-and-so’s child spent millions to study in some obscure country, only to return earning a few thousand yuan monthly.” While understandable, I’ve come to realize that one can only aspire to become the best version of what falls within their realm of awareness—you can’t pursue paths you don’t know exist.

Since junior high, when I started reading English weekly newspapers, I developed a habit of randomly noting down translated vocabulary beyond our curriculum. In high school, working with the “21st Century” newspaper, I filled countless notebooks with advanced vocabulary, word associations, and usage distinctions. It was then that I discovered my particular aptitude for remembering Chinese-English parallel texts, developing an almost eidetic memory for juxtaposed linguistic structures—a skill further enhanced by my high school politics teacher’s methods, which taught us to remember concepts by their spatial position in textbooks.

Even now, I distinctly remember encountering “ooze” and “lava” in a junior high English weekly’s fill-in-the-blank exercise; learning “spaghetti” from the junior high textbook, though teachers suggested it was optional; confidently using “bronchitis” in my first college class interaction, having learned it from high school texts without understanding the suffix “-itis”; and encountering “pigeonhole” in the 21st Century newspaper, translated as “subjectively categorize…” —a term that resonated years later when studying the Pigeonhole Principle in discrete mathematics, like “a bullet fired at seventeen finding its mark.”

Given my linguistic aptitude, one might have expected me to pursue a career in broadcasting, translation, or diplomacy. But that chapter remained unwritten. Instead, I found myself in what one professor recently described as a place “making a mountain out of a molehill” with “barely qualifying as an academic institution.” Yet, after a period of struggle, I, like the professor’s words suggested, underwent a “forced transformation.” Even I was surprised how, from such an inauspicious beginning, I managed to find my bearings. If I could tell my former self during the college entrance examination period—when my thinking was as constrained as bound feet—that “computer science isn’t that intimidating,” it would have shattered their worldview.

Here, few recognize the common threads connecting linguistics, computer science, artificial intelligence, cognitive science, psychology, and philosophy. Few care. Yet oddly, I see these connections clearly. Since late 2022, when I caught the first wave of ChatGPT registration, this institution should theoretically have been at the forefront of the AI revolution. “Should” being the operative word. Instead, I found myself repeatedly asking, “What’s the point of all this?” Reader, can you imagine an institution that lacks vertical research projects in linguistics, where engineering graduate admission quotas are determined solely by the number of horizontal industry-university collaboration projects? (This insight came from a recent conversation with an alumnus—are we recruiting graduate students or cheap labor?) I believe the surface-level “research for the nation” narrative masks an underlying extreme utilitarianism and superficiality.

I’ve always believed in giving both praise and criticism where due.

Thus, growing in the cracks, one inevitably takes on the shape of those crevices. I’m unsure whether to be grateful for these constraints—who knows what might have been elsewhere? But ultimately, one must grow toward the light, bear seeds, and let them scatter to new places. Anywhere but here.

My journey transformed from a pure sanctuary of English and language studies—now largely corrupted by environment and zeitgeist (in the AI era, rural children are “deemed unworthy” of pursuing pure humanities like literature or translation, unless exceptionally gifted—mark my words)—to linguistics with data science, to going all-in on 11408 (admittedly wasting considerable time), and finally to computational linguistics. Throughout, I’ve been seeking tranquility.

Nobody truly understands computational linguistics—anywhere. Even recognizing it as an intersection of linguistics and computer science is rare. I’ve conducted extensive research and written articles (see: rexera.github.io/posts/NLP_CL_DH/), gaining some understanding. But I completely comprehend others’ confusion. Historically, structurally, and evaluatively, it’s an umbrella term. As I’ll discuss later regarding its history, I currently believe any work that both “employs technology” and “genuinely addresses linguistic issues” can be termed computational linguistics, regardless of whether it’s grounded in humanities or engineering. Those who merely “employ technology” without “addressing linguistic issues” are doing Natural Language Processing—the only term my university’s faculty seems to grasp. Pragmatically speaking, computational linguistics can award either engineering or humanities degrees, though everyone knows which is more valuable.

Indeed, this should illuminate the field’s inherent uncertainty. Wavering, ethereal, elusive—much like the path that brought me here.

Indeed. While “most people” follow relatively established paths within disciplines with mature theoretical frameworks, practical methodologies, evaluation systems, and social recognition, I somehow found myself drawn to computational linguistics, particularly fascinated by exploring its possibilities at the frontier of LLM development—a topic I’ll revisit later.

This attraction perhaps stems from a particular mindset: The field has evolved from its first principles—”to process language computationally, one must understand both language and technology”—to today’s landscape where LLMs present unprecedented opportunities. The vast unknown that lies ahead recognizes no hierarchy of seniority. Perhaps I relish this sense of equality and co-creation, where meaningful dialogue with authorities becomes possible.

The drawback is clear: such uncertainty in direction inevitably brings both spiritual and practical confusion. However, the advantage lies in perceiving broad possibilities rather than being “constrained by established paths and evaluation systems, unable to innovate”—either because supposed innovations have already been exhausted, or because the discipline or path inherently prohibits them. Contemporary computational linguistics, far from such constraints, embodies the opposite.

Precisely because no one fully comprehends its nature, various schools of thought and theories from diverse disciplines can flourish.

I embrace this future.

Recently, I attended an academic wine discussion about anthropology, where a renowned scholar from Renmin University illuminated the discipline’s allure. Discussing Claude Lévi-Strauss’s “Tristes Tropiques,” particularly the chapter “A Writing Lesson,” the scholar touched on linguistics. I had previously encountered Max Weinreich’s observation that “a language is a dialect with an army and navy” while watching the Linguistics Iceberg on YouTube—a striking illustration of language’s power at the governmental level, though seemingly distant from personal development.

That evening, the scholar offered a fresh anthropological perspective: while written language isn’t “essential” for civilizational development, it becomes “necessary” in the process. Writing, from its origins in divinatory oracle bone inscriptions, has consistently manifested as an expression of “power”—those who master symbols wield mysterious authority: the art of communing with heaven and interpreting all things. For a language-learning rural child caught in the fervor of social mobility, this insight was revelatory.

This reminded me of Elena Ferrante’s recent lecture collection “In the Margins,” where she opens one talk with a poem: “Historically, witchcraft was punished by hanging, but history and I, we find the necessary witchcraft in our daily surroundings.” While she discusses our use of linguistic and literary devices as “witchcraft” for self-expression, this resonates with our discussion of language’s relationship with mysterious power.

In modern society, language and terminology construct boundaries around professional domains, thereby structuring social power. This applies across all fields: literature, science, engineering, philosophy, artificial intelligence, linguistics… Within social dynamics, varying degrees of linguistic sensitivity and mastery create an inevitable, immutable, predestined inequality.

Once again, I’m humbled by my destined connection with language. After such a winding journey without losing this thread, it seems this path was truly meant for me.

First Steps on the Academic Path

Since being redirected to the English major, I’ve been obsessively contemplating various perspectives: how practitioners view themselves within the discipline, how the discipline views other fields, how other fields view English studies… the growth trajectory of foreign language disciplines, self-perception, and social recognition. Perhaps most people don’t engage in such reflection.

In our region, foreign language disciplines typically lead to three paths: literature, translation, or education. Unfortunately, I found each unsuitable for different reasons. Literature seemed beyond my reach—in today’s world, pursuing such “useless utility” feels impossible without substantial financial backing. Translation, I firmly believe, lacks a future. Education failed to resonate with me personally—teaching children brings me no sense of fulfillment.

Thus, I gravitated toward linguistics—a more theoretical, academic path. Only recently did I realize I was in an environment that extremely prioritizes employment over academic pursuit—a characteristic prevalent across all disciplines here. Logically, and given my family background, this path might seem unsuitable. Yet, no other path captured my interest. Last semester’s Introduction to Linguistics was taught by a discourse analysis specialist, emphasizing practical applications and industry connections over theoretical foundations—perfectly concordant with our overall environment.

Earlier, during a period of uncertainty, I briefly explored traditional linguistics. It wasn’t formal study per se—just working through two chapters of a linguistics graduate entrance exam book and unconsciously absorbing various linguistics videos. Gradually, I learned to scientifically understand Chinese, English, and dialects; how to systematically approach language learning; the three aspects of linguistic study: phonology, morphology, and semantics; both microscopic language-centered linguistics and macroscopic applied interdisciplinary linguistics; major schools of thought and linguists in both Chinese and international contexts; primary theoretical frameworks; Chomsky’s contributions to both linguistics and computer science; Halliday’s functional grammar… Thanks to YouTube and Bilibili.

Yet I still sensed that traditional, quote-unquote linguistics was too nerdy. When ChatGPT exploded in late 2022, I was among the first wave of users, immediately recognizing its potential. This marked a crucial milestone. From this point, I experienced a significant crisis in my professional identity (as language skills—the pride of pure humanities students—became susceptible to large-scale automation), attempting, with extreme idealism, to completely pivot to CS (how naive).

My confidence in language skills was restored when I started working in an AI research group under a sub-advisor of a prominent professor last year. I discovered that these seemingly capable engineering students struggled not with technology, but with English—whether conducting research, reading papers, or understanding package and repository documentation. I gradually realized that technology is easily learned when tutorials are clear enough. But language learning has no one-size-fits-all tutorial. Put somewhat arrogantly, it’s easier for someone with strong language skills to acquire technical knowledge than vice versa. Wow, I’m startled by my own assertion here—it almost feels heretical. It’s certainly not universally applicable; take it with a grain of salt.

After several deep learning and LLM projects, particularly as I gained a more systematic understanding of LLMs’ evolution, principles, and applications—and especially after learning more about ACL (Association of Computational Linguistics), the top conference in NLP (interestingly named after “computational linguistics”)—I began to discover that computer science that “cares about language” not only exists but holds high status. I gradually realized that pure CS wasn’t suitable for direct competition, nor as a lifelong pursuit, leading me to deeper linguistic contemplation.

Linguistics itself has two schools: armchair linguists and empirical linguists. I prefer the groundedness of the empirical approach.

In my search for advisors, I’ve essentially mapped out all the places and professors in China who explicitly work in computational linguistics. The options are remarkably limited.

When Chomsky brought context-free grammar to computer science, as computers and the internet began to proliferate, language processing emerged as a natural concern. The initial wave of NLP technologies naturally developed around English, making Computational Linguistics an obvious field in the West. However, in China, we faced the anxiety of having zero digital infrastructure for Chinese while English-centered systems were fully mature. Everything—encoding, input methods, display—had to be built from scratch without precedent, not to mention subsequent processing tasks like search.

This led to calculating Chinese entropy to determine encoding bit-length, collaboration between Chinese language and computer science scholars, the emergence of “Chinese Information Processing” as a discipline, and the development of what we now take for granted: pinyin input methods and predictive text.

Once these fundamental issues were resolved, processing Chinese and English became essentially similar. Coinciding with the deep learning wave, solving language processing problems “seemingly” required less linguistic knowledge, even spawning absurd claims like “every time we fire a linguist, the speech recognition system improves.” Thus, “computational linguistics” disappeared, and “Chinese Information Processing” was forgotten.

“Natural Language Processing” took their place. Linguistics vanished from view, and terminologically, NLP became purely an engineering celebration. Today, among 100 AI researchers, 96 work in CV, 3 in NLP, and 1 in speech processing—imagine the survival state of computational linguistics.

So how did I excavate “computational linguistics” like an archaeological exploration? That’s a story for another section.

A Disciplinary Wanderer’s Path to Computational Linguistics

Thanks to my institution, I was once enchanted—even self-mystified—by the idea of “interdisciplinary collaboration” between Chinese literature and computer science scholars, the kind of “digital humanities” mentioned earlier. However, from my first deep learning project onward, this fascination gradually demystified itself. Looking back, today’s academic environment seems too impetuous, human motivations too complex, for my idealized vision of the past to materialize.

When did I completely lose faith in my idealized version of “interdisciplinary studies”?

Around this summer’s military training period, our university announced a breakthrough in “semantic information theory.” Having gained some international recognition in their traditional field of information and communication, they trumpeted it as “the first theoretical innovation in information theory since Shannon’s classical theory.”

Though I’m no expert in communications, it seemed too good to be true. With GPT’s assistance, I skimmed the paper. It felt increasingly familiar… The core idea was simply incorporating semantic dimensions into information theory using some deep learning models. Semantic processing with deep learning is already a thoroughly explored direction in deep learning-based NLP. Many information theory formulas, like cross-entropy, are common in NLP too. How did they integrate this with communications? I honestly don’t know. But I do know this exemplifies “stir-fry research”—establishing a new academic territory without substantial theoretical advancement or structural framework adjustment. It’s institutionally independent but intellectually derivative, merely applying one field’s tools to another’s problems.

While I admire the renaissance ideal, I deeply understand that being a “polymath” is unrealistic in our era—most people will, and should, find their specific niche. It’s about balancing breadth and focus. Focus is essential; equal effort across all areas is impractical. Interdisciplinary work can only succeed with clear priorities and institutional independence. The idealistic A + B = C seems nonexistent. We can’t even call it “XX-ology,” only “XX studies”—like interdisciplinary studies or legal language studies.

What frustrates me? Partly that I could probably do similar work, but more fundamentally, my inability to settle. I seem to perpetually wander between disciplines. Yet disciplines are inherently connected, and finding one’s place is challenging enough. Perhaps someone needs to stand at these intersections? Perhaps someone needs to wander? I’m not sure.

The figure shows my simple mapping of: The intersecting relationships and key concepts between mathematics, information theory, linguistics, and communications after studying the “semantic information theory” paper

I’m strange.

Amidst the forest of disciplines, I seem to perceive what others miss—the interconnected pathways, the intricate networks, the glimmers along these routes. Yet I resist settling down. Perhaps it’s only through settling that one can take root, accumulate knowledge, and meet society’s evaluation criteria.

Zhuangzi’s Inner Chapters state: “Life is finite, while knowledge is infinite.” Using our limited lifespan to pursue unlimited knowledge can only lead to physical and mental exhaustion, contradicting the principles of nurturing life. Zhuangzi advocates for “ultimate knowledge,” suggesting that knowledge aligned with the Way is beneficial, while knowledge contrary to it should be minimized. The pursuit of knowledge involves both accumulation and discrimination—identifying what aligns with the Way and eliminating what doesn’t.

But what is the “Way”? What constitutes alignment or contradiction? Must everyone have a fixed point of landing? Must we confine ourselves within disciplinary labels?

I must settle down. I must even actively label myself within a specific discipline. Because a “certain discipline” label serves as a quick identifier for the outside world—however reluctant I am, I must wear this hat, as swift social recognition benefits oneself. Like bitter but effective medicine. Yet internally, I need to understand that disciplinary divisions are bullshit; what matters is doing what I believe is right.

Whether under a linguistics or computer science master’s program seems less crucial now. Someone needs to stand at the intersections of disciplines. You don’t need to become someone you can’t be.

At that time, I hadn’t yet boldly added “computational linguistics” to my profile. Looking back, having such thoughts seems to mark a beginning.

The second hurdle I needed to overcome lay here too—I once wanted to directly challenge the 11408 (English I, Mathematics I, Computer Science Fundamentals 408). Until a senior student enlightened me: “You don’t need to become someone you can’t be.”

Recently, I saw someone criticize Wang Dao’s (a Chinese computer science graduate exam prep institution) courses as “too exam-oriented” for applications, suggesting MIT’s online courses were more appropriate. I looked at MIT’s navigation page and couldn’t even understand it 😅… So I opened the long-bookmarked csdiy site. The CS world is vast, with each intermediate checkpoint worthy of extensive time investment. But my time seems somewhat limited. For a lifetime, it’s fine. For next year’s applications, it’s too little.

They say there’s a significant gap between domestic and international CS education. Csdiy showed me that even China’s top 2 universities share this uniquely Chinese confusion. The requirements are particularly high for undergraduates entering industry directly. No wonder foreign CS education (at least in terms of teaching) is so solid—their undergraduate graduates face very different conditions.

Some people specialize in CS, but we must allow for those who stand at disciplinary boundaries. Whether mediocre in both fields or seemingly ineffective, you are yourself. For academic advancement and research in your niche, you don’t need to be an expert in every aspect of a field.

Top 2 universities are top 2 for a reason. The other 98% still need to survive. Know where you can go, but don’t reach beyond your grasp. Do what you believe is right, based on a strong heart and sharp vision. Don’t let observations about others completely disrupt your rhythm.

When I’m stuck in the puzzle of project reconstruction, I need to step back and remind myself: this is Microsoft’s project, this is Huawei’s project, this is Tsinghua’s project, this is Alibaba’s framework, this is Google’s paper. Getting stuck is normal; solving it is remarkable.

First, modern research has no “low-hanging fruit” (Yu Miao, “Guide to Modern Research”; 于淼《现代科研指北》), hence the shift toward problem-oriented, naturally interdisciplinary approaches.

Second, majors, disciplines, and fields are different. I’m choosing a major now, but fields can transcend majors.

Third, people need to settle to take root.

At this point, computational linguistics and I have finally come together.

From NLP to LLM: Finding My Place

During freshman orientation, I had a vague feeling I would choose the “Linguistics and Language Technologies” concentration. Ironically, this concentration has now been replaced. Ha! The new program ties everything to education. Indeed, everything gravitates toward what’s manageable and moldable. Truly, whoever controls language and “discourse power” wields mysterious authority.

Given my environment, struggles, and peculiar personality, I’ve attempted to distinguish between computational linguistics, natural language processing, and digital humanities, making some headway in differentiation. (See: rexera.github.io/posts/NLP_CL_DH/) However, at the cutting edge, these fields increasingly converge. Computer scientists are expanding their research possibilities using social science approaches, while linguists employ technical means to enhance language research efficiency.

When I told Professor X, a longtime NLP researcher, that I might lean more toward computational linguistics (focusing more on language itself rather than general tasks), they corrected me, explaining that CCF’s special committee is actually called the “Computational Linguistics Committee,” encompassing algorithms, models, evaluation methods, and more. Upon reflection, I realized we could understand it this way: natural language processing is a craft, while computational linguistics is a discipline. This began to demystify things.

Later, during interviews, I met Professor Y, who was particularly interested in my infusion of linguistic theory. I’ll omit some details here 😙. In essence, through Professor Y, I discovered that English language and literature aren’t as distant from NLP as they might seem. The frontier of computational linguistics and corpus linguistics lies in natural language processing, and NLP is no longer a field with insurmountably high barriers. How so?

When Professor X first “introduced me to the field,” NLP still seemed mysterious and unknown to me. They said, “Your generation doing NLP will basically all work with large language models.” The way they phrased it was particularly memorable, leaving a deep impression.

A year later, this has proven true. LLMs have effectively “killed” many traditional NLP tasks (information extraction, named entity recognition, sentiment analysis, article summarization) and the machine learning and deep learning models specialized for these tasks—at least from a “research significance” perspective. Since universities lack the hardware capabilities to train LLMs from scratch (all universities, without exception), the entire NLP field has shifted toward “creative LLM applications,” essentially forcing a massive paradigm shift on the NLP academic community. While understanding these traditional technologies and tasks as historical context and auxiliary tools remains necessary, research now basically requires just understanding LLMs.

Generalized “computational linguistics” and LLM-based “artificial intelligence” can integrate with many fields. Medical, biological, pharmaceutical, financial, architectural, legal, mechanical… scholars from various domains have ventured into so-called “AI empowerment.”

Traditional machine learning required “feature engineering”—apples are round, red, sweet; text, images, and audio needed separate processing—then machines would determine if new data matched certain patterns.

Modern machine learning is all about embedding/encoding/feature extraction, transforming features into vector/matrix representations that only machines can comprehend. Subsequent recognition, classification, and generation are purely numerical computations… differing only in model parameter size and computational speed. Hardware advancement has made massive parameter scales possible (time to invest in NVIDIA overnight). A linear equation has at most two parameters, capable of dividing a coordinate system into two regions. Rumor has it GPT-4 has 1.8 trillion (1800B) parameters, essentially able to “kill the game” in all classic NLP tasks.

Returning to computational linguistics, this means tasks requiring intense expertise, imagination, and creativity—like deciphering oracle bone inscriptions (Deciphering Oracle Bone Language with Diffusion Models, Guan et al., ACL 2024)—can now be delegated to machines. Machines might propose compelling solutions in their own way, with solution “processes” incomprehensible to humans, only reconstructable in reverse.

As parameter counts increase and LLMs become basic sensemaking machines, various symbolic and first-principle methods find their place. This is what I call computational linguistics’ linguistic turn. Similarly with “human-like intelligence”—once we view LLMs as “entities” with basic sensemaking abilities and occasional brilliance, research becomes fascinating.

During interviews, we discussed the advantages of multiple agents over single agents: Kahneman’s System 1 and System 2 theory from “Thinking, Fast and Slow” plays a role here. OpenAI’s o1 actually employs System 2 methods, forcibly increasing intermediate deliberation, preventing models from “blurting out” answers and instead “thinking step by step”—essentially neuroscience/psychology inspiring LLM development. Multiple agents work similarly: breaking down and redistributing tasks proves more effective than single-step thinking.

Look how interdisciplinary fusion permeates “problem-oriented” research. Why shouldn’t linguistics contribute?

Well, a year ago, someone actually told me “linguistics is useless,” and I internalized that for quite a while. So next, I’d like to make an interesting callback to a moment from a year ago.

It was around November 2023 when I attended a lecture by an LLM team leader from Chinese tech giant B, discussing the evolution of LLMs. At that time, “LLM pre-training” were still in heated development, unlike now, when people have gradually realized that regardless of parameter increases or architectural optimizations, “LLM pre-training” are unlikely to achieve improvements as impressive as ChatGPT’s late-2022 breakthrough. The speaker maintained an almost fundamentalist belief in Scaling Law. I still had long hair then and had just begun exploring “data science,” genuinely knowing nothing, when I asked about linguistics’ role in LLMs. His response (stripped of diplomatic language) was that we’d been stuck in rule-based expert systems for far too long, and focusing on linguistics now would be a mistake—LLMs would continue scaling indefinitely, with endless performance improvements.

The most devastating line was: “Humans are prone to errors.”

That moment was catastrophically demoralizing for a linguistics student like me. I remained dejected for a long time, even concluding that “the devil had a point—current NLP does lean more toward computer science.” Looking back now, this view was both right and wrong. Models do follow scaling laws during “pre-training,” but the biggest current challenge is that we’ve exhausted all recorded human text. Yes, truly exhausted—I’m surprised too, but technological progress has been remarkably swift. Training models on human-AI hybrid “augmented data” or purely AI-generated data leads to model collapse beyond a certain point. This isn’t just my opinion; it was Nature’s cover story this July.

Thus, universities’ so-called “post-training” experiments with “creative LLM applications” have naturally emerged. Some propose concepts like Post-Training scaling law or inference scaling. This suggests that whether through prompt engineering, multi-agent clusters, or decomposing slow thinking, as long as deliberation continues to expand, the entire system’s (rather than just the model’s) performance will keep improving.

This logic is quite un-“internet-like.” Who could have predicted it? Such insights emerge only through research inquiry.

Today, I want to tell that speaker, “You too are prone to errors.”

Research has meaning. Absolutely.

Ironically, while my university provided me with a starting point and general environment for research, all my research-related understanding came from external experiences. It’s quite remarkable. Good institutions don’t “restrict,” “exploit,” or insist “you must follow my direction,” but rather say “you can pursue your own research interests.” Though I think I might be too full of ideas—often, the institution isn’t really to blame.

I saw a “group portrait” on Zhihu: Students who’ve heard secondhand that “research is important,” who’ve helped review a few papers, run some baseline experiments, yet still have no real concept of what research is—when asked, they’re clueless about everything. In the supply-demand relationship, two types are popular: those who thoroughly understand their advisor’s direction, are clear-headed, and have their own insights and ideas; and those who prefer having no ideas but have good grades, are obedient, and easily manageable.

Having “too many ideas” is actually quite suitable for research, though it might make life difficult in certain environments. Nevertheless, at this point, we must “make art in the cracks, create temples in snail shells.” (A Chinese saying, “夹缝里做文章，螺蛳壳里做道场”) That’s why I’ve always felt there’s still some distance between LLMs, AI, NLP, and computational linguistics. But recently, while conducting experiments, I suddenly realized: as long as I’m working with text data, I’m truly doing NLP.

Running four experiments simultaneously, I suddenly realized that all data is structured characters, and data processing is essentially processing characters—processing “language,” “natural language processing.” Regular expressions, which I once scorned, proved incredibly useful in my process of “fitting rigid frameworks to living LLMs.” Perhaps this is where NLP’s value lies.

From NLP to LLMs to computational linguistics, I’ve painfully distinguished and deeply explored concepts, witnessed their evolution over time, and skillfully found my position among them. Painfully distinguishing concepts, yet not being bound by them. However…

Does Emphasizing “XX Linguistics” Still Matter?

Absolutely.

My question centers on this: for instance, in “corpus linguistics,” the cutting edge actually lies in natural language processing. As we’ve seen, “computational linguistics” has very blurry boundaries with natural language processing. So why insist on emphasizing “linguistics”?

This was a question I didn’t expect to find an answer to. Yet surprisingly, thanks to my institution, I got the opportunity for clarity.

We hosted a “Corpus Beijing” event that gathered professors from several universities. Curious about what corpus linguistics actually does, I attended using my convenient “undergraduate student” identity—a status that lets you bypass hierarchical concerns and respectfully address everyone as “professor.” (“老师” in Chinese)

Professor W’s keynote speech illuminated the disciplinary origins of corpus linguistics: At Shanghai Jiao Tong University in the 1980s, computer science and linguistics showed no signs of methodological integration. That pioneering spirit of “doing what’s never been done” or “thinking what’s never been thought” was truly moving. Reflecting deeply, my current challenges pale in comparison to those pioneers’ groundbreaking work. In an era where even basic computational resources were scarce, having such vision and courage was remarkable.

Corpus linguistics discusses scientific methods for studying language: seeking patterns in real, living, large-scale corpora, using technology and concepts to create new knowledge through empirical verification—collocation, colligation, mutual expectancy… This defines corpus linguistics’ scope from ontological, epistemological, and methodological perspectives.

Before 1985, corpora were linguists’ domain; computer scientists were still using Chomsky’s rule-based methods—inevitably reaching dead ends. NLP scholars joined existing corpus linguistics, and computational linguistics incorporated corpus methods—the seemingly contradictory aspects (technological and linguistic perspectives) have always been interdependent.

Forty years later, NLP scholars present new challenges to corpus linguists. Professor W noted that in this critical period of both challenges and opportunities, corpus linguistics needs to embrace this paradigm shift. The professor expressed eager expectations for newcomers. Yet I still struggle to clearly distinguish between corpus linguistics and computational linguistics.

We presented our so-called “corpus linguistics” achievements—a patchwork of foreign languages, Chinese, and artificial intelligence. Professors from other institutions responded diplomatically: “We hadn’t realized corpus linguistics had such rich implications here.” Ha! Was I the only one who caught the undertone? I’m not sure. Our “humanities” on Nanfeng Road has never had independent character, always parasitic on the school’s specialized features. As one professor put it, “relying on ‘fraternity departments.’” Sure.

Then, surprisingly, we discussed internal work independent of these “fraternity departments.” Professor O, whom I greatly admire, opened with: “Oh? You have a foreign language department? You do research?” Overall, internal development is indeed rather impoverished, with too many scattered directions failing to create synergy. Moreover, the university’s demands on scholars have become increasingly impetuous: requiring assessments and evaluations, completely disregarding how social sciences and natural sciences require fundamentally different evaluation criteria.

Looking ahead, we’ll do more second language acquisition, academic discourse, and English-Chinese comparison; more vertical research projects (currently far too application-oriented, too many horizontal projects); recruit more high-level scholars. Establish “organized research.” View problems through future prospects.

Professor O poignantly noted that interdisciplinary collaboration should be equal and voluntary (what experiences led to this, I wonder, haha), as engineering professors often expect linguistics professors/scholars to serve as their assistants (literally “be their hands”), finding applications for their technology, “doing NLP just for the sake of doing NLP.” Oh trust me, I understand this all too well.

Again, language research here lacks independent character. Professor O concluded by emphasizing that “leaders need a certain breadth of vision.” Indeed.

Finally reaching Q&A, I struggled with my usual physiological nervousness during such sessions. After hearing several questions and finding my composure through several mental adjustments, I posed a substantial question:

I introduced myself as an English major studying computational linguistics and corpus linguistics. I expressed curiosity about the frontiers of corpus linguistics and computational linguistics, noting how their boundaries are increasingly blurring. While we’d discussed technology’s contributions to language research and education, I wanted to offer a reverse perspective: linguistics/social sciences’ inspiration for computational linguistics. I mentioned ACL’s work—role-play, various benchmarks inspired by linguistics and social sciences.

Does positioning matter? Is disciplinary independence important? Must we emphasize this “XX linguistics”?

To be honest, my question left several professors bewildered. But Professor W, true to his landmark status, provided an incisive answer. Here’s my interpretation of his response (with some adaptation):

Why have distinct disciplines? Different research objectives and historical trajectories necessitate separation. Historically, corpus linguists’ value has always centered on serving language education and research. What does this disciplinary division bring? Both mutual support and group biases (in evaluation standards and group expectations). Newcomers to research must meet these disciplinary expectations to enter the scientific community, whether satisfying these expectations serves to maintain or break them.

Linguists are primarily interested in language itself. They’re not concerned with improving machine performance metrics. Modern research’s “problem-oriented” approach has eliminated “low-hanging fruit,” inevitably leading to interdisciplinary work. Yet disciplines’ core nature—their “history” and “classics”—constrains interdisciplinary efforts. In this “post-classical” era, while authority faces challenges, we still need steadfast dedication—”an artist’s persistence.” Indeed, academic disciplines have reached a historical cycle of metabolism.

Researchers are driven by curiosity. Our curiosity reflects our self-exploration and questioning. But what exactly are we questioning? Linguistics addresses language structure, semantics, and pragmatics. What truly interests you? As “corpus linguists,” as “linguists,” we fundamentally question and engage with language itself.

Enlightening. No other word could describe this revelation. A true epiphany.

After Professor W broadened everyone’s perspective, other professors offered fascinating insights. One professor remarked on rapid temporal changes, noting that teachers themselves need self-development. Humanities face particular challenges, even identity crises, making mutual inspiration valuable—calling for deep collaboration between humanities and engineering.

Another professor pointed out the “re-discovery” pattern: different disciplines often discover the same thing twice, giving it different names. Indeed.

One professor criticized how many humanities scholars claim to “embrace technology” while few actually engage with it. While current research is problem-oriented, the problem-solving tools themselves can constitute a discipline: for corpus research, statistical “instruments” like logistic regression and variance analysis are absolutely updatable and worthy of study.

Another professor highlighted different definitions of “innovation” between humanities and engineering. Engineers see automation and de-humanization as innovation, while humanists emphasize human capability—suggesting these engineering professors lack broader vision. Correct.

Synthesizing these perspectives, I realize I’m genuinely interested in language itself. While my methodology might lean more technical than theoretical, my interest centers on language. Finding this positioning is crucial.

Thus, my future path became clearer: to be a computational linguist. Caring about technology but caring more about language. Using technology without fetishizing it.

Therefore, emphasizing the “linguistics” in “computational linguistics” is absolutely essential for me. I must distinguish myself from NLPers who “only focus on performance metrics,” both in self-perception and external recognition. It’s truly necessary.

LLMs and Linguistics

In contemporary computational linguistics, LLMs are an unavoidable topic.

Currently, those discussing LLMs and linguistics are typically domestic armchair experts who, having benefited from previous era’s opportunities, left the frontlines without ever writing papers or conducting experiments themselves. They treat journal submissions like social media posts—how can they claim genuine insight into technology and engineering? While eloquent about LLMs’ impact on linguistics, they’re speechless about linguistics’ contributions to LLMs, often making fundamental conceptual errors.

LLMs’ massive parameters, general language capabilities, and reasoning abilities have created opportunities for first-principle methods represented by linguistics and symbolic AI, allowing computational linguistics to lean toward linguistic theory. This also gives computational linguistics potential for broad application like functional grammar, extending to almost every aspect of every discipline involving human participation.

As for the discipline itself, computational linguistics has plenty of significant problems to solve. Alignment and evaluation alone are substantial enough topics for three to five years of graduate research. How can machines understand domain-specific natural language like humans? How can they generate natural, high-quality sentences like humans? What kind of language machines do we actually want? Those that explain complex concepts simply? Those that dissect with surgical precision? Those with humanistic care?

Chomsky wrote a New York Times article in 2023 titled “The False Promise of ChatGPT”—essentially arguing that human thought and language use represent “elegant operation based on minimal information,” while LLMs’ language generation, marinated in massive corpora, is merely crude imitation, not fundamentally “understanding” language—suggesting LLMs offer no theoretical innovation.

But that’s precisely the point. LLMs are an NLP engineering achievement, never meant to solve linguistic theoretical problems. It’s like criticizing electric vehicles for being less safe than cars, or cars for being less agile than electric vehicles—clearly two products serving different scenarios and preferences.

Then this year’s Nobel Prize in Physics was awarded to John J. Hopfield and Geoffrey E. Hinton for artificial neural networks. Hinton’s acceptance speech repeatedly, both explicitly and implicitly, disparaged the symbolic school, declaring how previous language rules were useless or even misguided AI development.

But was Chomsky really wrong? His claim that language isn’t acquired but innate has some merit, though it’s problematic—what about my struggles learning languages? His assertion that LLMs don’t truly understand language is both problematic and not. Personally, I think this involves an ethical question, similar to standardized versus non-standardized units—if we broaden our threshold for “intelligence” and “language understanding,” not requiring machines to comprehend language exactly like humans, not demanding a Marxist materialist warrior’s stance that “artificial intelligence is a tool, human subjective initiative plays the leading role,” then at a “results” level, machines have indeed understood language.

Evidently, Chomsky’s definition of “language” differs from ours. In his definition, language is a brain function state, not an objective world phenomenon or human social phenomenon.

To better understand linguistics’ potential role in LLM development, I offer three points:

First, as Professor Mu Li mentioned during his lecture at Shanghai Jiao Tong University: LLM pre-training is now a technical issue, while post-training is an engineering challenge. Connectionism, represented by Hinton, has essentially solved pre-training issues after decades of development—we’re unlikely to see qualitative improvements in LLM performance at the pre-training level. What’s next? Knowledge engineering will rise again; rationalism and symbolism represented by linguistics will stand equal with connectionism represented by neural networks. The authority of “the bitter lesson” will be challenged.

Second, we’ve run out of pre-training corpora. Moreover, AI-generated text is massively contaminating human corpora. Foundation models show little progress beyond increasing efficiency and reducing computational complexity and resource demands; the internet mindset of “stack data, computing power, and parameters to improve performance” no longer works. We’re in the “post-training” era—as mentioned above, scaling needs to focus on model deliberation: incentivization, guidance, reinforcement learning, symbolism, self-play…

Third, is language truly thought itself? A Nature paper from MIT says no.

Evelina Fedorenko, Steven T. Piantadosi, and Edward A. F. Gibson. 2024. Language is primarily a tool for communication rather than thought. Nature, 630(8017):575–586.

Language expresses thought but isn’t thought itself. Ethically speaking, this suggests LLMs are merely parroting, not truly thinking. But I find this somewhat conservative—if we philosophically and ethically broaden our definition of “intelligence,” LLMs have already achieved basic intelligence and thinking at a results level. Some say LLMs and linguistics will diverge? Not necessarily—it depends on what linguistics we’re talking about.

Last week, Apple’s team published a paper arguing that LLMs are just more complex, fine-grained pattern recognition, not human-level reasoning.

Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. 2024. GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models. arXiv:2410.05229 [cs].

I think this is absolutely correct. But this raises a philosophical question reminiscent of the concept of “atoms”—dividing matter until it’s indivisible. If we break down human reasoning into pattern matching, subdividing until we reach indivisible units of reasoning, and perform pattern matching on these finest units until performance only improves with fundamental framework changes, haven’t we achieved human reasoning capability at a results level?

Though LLM researchers haven’t reached consensus on this point, I firmly believe that LLMs can’t reason—they perform linguistic sensemaking rather than reasoning. Language itself isn’t reasoning but its manifestation. Our task seems not to make LLMs “learn” reasoning mechanistically, but to provide constraints that make their sensemaking more closely approximate good reasoning, improving performance at the results level.

I’m not pre-deterministically stating linguistics’ definite role in LLMs. I’m using incomplete induction to demonstrate some possibilities for linguistics in LLMs. This only addresses core LLM issues, without even touching “applications”: hate speech, discrimination, bias, political science, computational humor, social simulation… There’s so much potential. Yet in China, linguists don’t see it, and LLM researchers don’t care.

Growing Together with Computational Linguistics

At this point, LLMs have helped me establish a profound identification with “computational linguistics.”

This semester in our technical translation course, we had an assignment translating a webpage about GPT’s principles. Coincidentally, this was the same professor from my first “bronchitis” interaction, who first introduced me to the possibilities of language intersecting with computation, law, medicine, finance, and other disciplines. Two years later, I casually mentioned to them how LLM terminology proliferates too rapidly for fixed translations, citing examples like Transformer, token, and Key-Value-Query. They noted this as a characteristic of young fields. Later, I realized: young, yes, but this also means unformed—still offering upward mobility and the power to define things.

For a rural child seeking social mobility, this is incredible news.

I firmly believe my chosen direction holds countless future possibilities. While social recognition, disciplinary construction, and evaluation standards may present challenges, it’s a vibrant future open to all disciplines, full of fascination and potential.

My most recent depressive cycle bottomed out this week. Last week, I maintained extraordinary efficiency—morning coffee, afternoon coffee, sleeping at 1-2 AM before rushing to 8 AM classes, spending entire afternoons and evenings in cafés running experiments and writing papers, making remarkably smooth progress. Unexpectedly, this week began with obvious mental numbness and neural fatigue. Coffee-borrowed energy always demands repayment. So I completely shut down for several days, sleeping thirteen to fourteen hours twice, finally finding time to organize this piece—originally meant to be several articles—which I’ve been accumulating since July or earlier, for my own entertainment.

Wednesday night, I video-called my father. I mentioned feeling so low that I didn’t even have a friend to drink with—my social life was truly tragic. I expected him to criticize my sentimentality, but surprisingly, he said, “Why did ancient emperors call themselves ‘the lonely one’?… Your position naturally brings solitude—humanities people can’t understand you, and you can’t fully integrate with science people.” Unexpectedly finding comfort from my typically stern father. The sun rose in the west.

Indeed, perhaps in a different environment, starting fresh, meeting more accomplished people, I might find more social happiness.

“Look at so-and-so, simply studying and returning to work from the bottom up, then getting promoted. Your pursuits are so lofty—changing majors, pursuing graduate and doctoral studies—of course it’s niche.” Thinking of all my peers they know, truly none are as “difficult” as me.

I’m different from “most people.”

Yes, throughout this journey, it’s always been because “I’m different from most people.” This uniqueness is both curse and blessing.

What is computational linguistics? Nobody knows.

Who am I? Nobody cares.

But I know. I care.

Computational linguistics grows through chaos with vigorous vitality. So do I. Perhaps these fifteen thousand flowing words are just a story I’m telling myself. In this story, Odysseus overcomes challenges, receives guidance from masters, breaks boundaries, and merges water with fire—not for anyone else, but for himself.

The first year’s story is already so rich, making one wonder about the shape of the future.

Disciplines

This post is licensed under CC BY 4.0 by the author.