Archive for the ‘language’ Category

Help computers sound more human!

2012/06/01

The annual Blizzard challenge has been launched. This isn’t some crazed track-meet for mad-scientist climatology. It is a competition – a sort of annual standardized test – for speech synthesis systems.

I could gabble on about speech synthesis and the exciting progress that is being made in the underlying technology, to help computers sound more natural and human-like. But I don’t want to bore you. (If you want a post on that, leave a comment. If I get any interest, I’ll post something.)

I think I’ll just mention that, aside from having a less grating voice in automated phone systems and whatnot, high-quality synthesis may be a real help for people with severe disabilities. You are likely familiar with Stephen Hawking, rock-star physicist and mathematician. You probably also know that he is wheelchair-bound and uses a computer to talk, due to a degenerative neural disorder.

He and many others, old and young, rely on computers to give their thoughts a voice. The better we get at producing human-like speech with computers, the more naturally these people will be able to interact with each other and with those of us who take fluent speech for granted.

So, I invite you to help make speech synthesis better. Your role in the Blizzard challenge is to rate the synthetic speech generated by the systems that have been entered in the contest. Just go here to start. (If you are a speech expert – such as a phonetician or speech technologist – then you need to use this link instead. It’s well-established that studying speech alters your perceptions in important ways. The task is the same, but the data will be analysed separately.)

Have fun!

 

Exploring language

2010/12/20

Thanks to Steve Novella of the Neurologica blog, I have discovered a new toy to play with: the Google Ngram Viewer. I’d like to share it with you, and encourage you to play as well.

You may have heard of all the books that Google has been digitizing for their Google Books project. (It caused some stir among publishers and writers – it seems to have been sorted out now.) Well, the Ngram Viewer is a very Google-esque* way of looking at word count statistics from that huge collection of books.

Let’s say you’re curious about the relative popularity of two words – say, “humanist” and “atheist“. Well, you enter them as search terms, and voila:

humanist/atheist unigram graph, 1800 to 2000

Relative frequencies of "humanist" and "atheist" in the Google Books corpus, from 1800 to 2000.

We can watch the relative frequencies of these words over time. Unexpectedly (to me at least), we see “humanist” (blue) overtake “atheist” (red) during the first half of the twentieth century, following a couple of decades (20s and 30s) tracking together.  I’ll leave it up to readers to try to infer the reason for this inversion.

The term “n-gram” (yes, pronounced the same as “engram”, but there’s no connection to neuropsychology or Scientology) is used in corpus-based linguistics to denote sequences of words. A unigram is a sequence of 1 word; in the graph above, we compare the frequencies of two unigrams (relative to the total number of unigrams in the corpus). A bigram is a sequence of 2 words. Trigram: 3 words. From there on, it is common just to use the number: 4-gram, 5-gram, etc.

One more unigram comparison that I thought was interesting: function words. Check out this graph comparing “the”, “and”, “of”, “for”, “a”:

Unigram frequencies for selected function words, 1800 to 2000.

Unigram frequencies of "the", "of", "and", "a", "for", from 1800 to 2000.

What is interesting here is that the relative (and even the absolute) frequencies show very little change over two centuries. Think about all of the change in the language that those two centuries represent – from shortly after the founding of America to around the time of the latest millennial fever. And these five words have shown such amazing constancy. Sure, there is some change, but compare those to the changes in other graphs, and the difference is clear.

So, let’s check out a bigram comparison. Here’s a chart of “national debt” and “social security”:

Bigram frequencies for "national debt" and "social security"

Bigram frequencies for "national debt" and "social security" from 1800 to 2000.

I’m no political scientist, but it looks like interest in social security leaped onto the scene in the late 30s, and has been slowly climbing ever since, while talk about national debt (in the English-speaking world) has steadily declined basically since the earliest samples in this corpus.

I could go on all day about this, but I’d rather leave it to you now. Before you take off to do your own informal surveys of this delicious data repository, let me offer a couple of caveats.

First, the numbers are only as reliable as the sources. What are the sources? Google gives some information on this. They note some sources of error; they also acknowledge some inherent biases. For example, there are more computer books in recent years than in the 1800s. Whether this is a problem or not depends on the sort of question you’re asking, and how you are interpreting it.

Second, there are different numbers of books in different time periods. They actually go back as far as 1500, but you get problems when, say, a particular year only has one book published. (Check out the results for that nice constant graph of function words, if you go back to 1500.)

Third, always always keep in mind what it is that you’re measuring. These graphs do not measure belief (search “bigfoot, ufo, unicorn“). They do not measure popularity or approval (search “murder, charity“, or the “national debt, social security” illustration above). They simply measure how often people mention the words (or bigrams, trigrams, etc) in published books. (Periodicals are excluded.)

Having said that, it is still a delightful way to while away a day. If you’re stuck for ideas, here are a couple of classic sources of interesting patterns:**

  • What are the relative frequencies of different number words? Is there anything systematic here? Any surprises?
  • What are the relative frequencies of gender-marked pronouns (“he, she”, for example)? How about gender-marked nouns (“man, woman”)?***

Have fun, my merry scientists!

Footnotes:

* Google-esque: powerful, easy to use, with the potential to distract me from real work with its endless possibilities to explore.

** Before doing any search, see if you can guess what the results will be. Form a hypothesis, give a reason for your expectation. If the results agree with your expectation, congratulations! If not, see if you can explain why. Does this new explanation generate predictions about some other word frequency pattern that you could now test?

*** There is at least one pair of gender-marked nouns that seems to reverse the general trend. Can you find them? Why would they be different?

Secular double entendre

2009/12/01

(Note to my religious readers: The following is not intended as an attack on religious belief, but I can foresee some sensitivities being nettled nevertheless. If you’d rather avoid being offended, feel free to stop reading now.)

I was just watching a video at the Friendly Atheist, promoting the Secular Student Alliance (SSA). It’s the American version of our National Federation of Atheist, Humanist and Secular Student Societies (AHS) – a nationwide organization aimed at building communities of secular students (atheists, agnostics, etc) at universities, colleges, and schools. Here’s the video:

Now, I know this will reveal my linguistic geekiness in its fullest degree, but the line that stuck out most to me was this:

[We believe] that science and reason lead to more reliable knowledge than faith.

Why, you ask? Syntacticians in the audience will already see where I’m going. There are, in fact, two high-probability, grammatical ways to parse this sentence in English.

The one that was intended could be paraphrased as so:

We believe that science and reason lead to more reliable knowledge than faith does.

Here’s the alternative reading:

We believe that science and reason lead to more reliable knowledge than to faith.

Okay, so the second reading doesn’t works quite so well. But, both readings are consistent with the general outlook of atheists and humanists. We trust science and reason above faith* as paths to reliable knowledge, and we think that science and reason lead us to knowledge rather than leading us to faith.

Oh, and hooray for SSA and AHS – go check them out if you’re a student!

—–


* It is worth noting that this all uses the meaning of “faith” used by most humanists, which could most succinctly be expressed as “belief that does not rely on evidence”. Many religious people use different definitions. I think I may need to add another post to my series on definitions.

Language rant by proxy

2009/10/03

There is a rant that I used to share with any willing audience when I was an undergraduate student in Calgary, inspired by my burgeoning knowledge of how language works, and how different that is from the opinions spouted by language mavens.

I learned, through the brave confidence of a few close friends, that it was becoming a bit tedious to hear this rant over and over again – despite the inherent and unquestionable validity of its content, of course.

So I am delighted to point you to Gareth’s blog*, where he has essentially channelled my rant from past years and a continent away. (Though, I confess, I never did come up with as clever and apt an analogy as he does with the clothing thing.)

The basic thesis: issues of right and wrong in language use are, pretty much always, relative.

I don’t think it’s a stretch to suggest that most linguists would accept Gareth’s position as pretty obvious. But there are vast swathes of people (even intelligent people who think about language a lot) who think very differently. Let’s hope his lucid prose will sway some of them.

[Update 2012: Gareth's blog sadly no longer exists. I won't delete this post, but I'm afraid without Gareth's content it loses much of its point.]

Truth about dyslexia

2009/01/15

Gentle and thoughtful Cath is on the warpath – and justifiably so.

Some loon (Graham Stringer, MP for Blackley) is trying to deny the existence of dyslexia, a well-evidenced and widespread disorder, using false claims, fallacious reasoning, and other tactics familiar to rational people.

I cannot add anything to her very lucid demolition of his article. I’ll just give a couple of potentially useful links for further reading.

Here’s an international list of Dyslexia research and support organizations that a quick search brought up, in case anyone wants to read further: Dyslexia Parents Resource.

And if you’re from the Blackley area, check out Stringer’s contact info, voting record, etc. on this page.

Test your ear for language

2008/08/25

There’s a fun game for anyone who is curious about world languages. It’s the Language Quiz on Simon Ager’s Omniglot blog: he posts a short recording of a language, and you get to guess which language it is.

After several months of playing , I’ve only once guessed the right language. Well, this recent quiz is especially intriguing, so I thought I’d share it with you. Warning: the answer has already been given in the comments, so try listening first before you read them. (It’s tough, because the comments appear right below the text of the very short post.)

If, after listening, guessing, and checking your answer, you want to know why this is my favorite one so far, ask me in the comments and I’ll tell you. (I’d do it here in the post itself, but I don’t want to give away the answer before you try the quiz.)

Thiakian king on death, and my response.

2008/07/07

Today I give no argument, no news.

An ancient tale inspired a train of thought.
I’ll share the thought in pent-iambic verse,
the English form Fitzgerald used to scribe
the ancient epic Odyssey from Greek.

He renders Homer’s tale in vivid lines,
a saga of a man who seeks his home.

Odysseus speaks the lines that woke my muse;
recounts to his Phaiakian hosts his woes.
He’s sailed from Troy; he’s sacked an isle, and left.
Before he left a few of his men were slain.

Odysseus tells what happened then, at sea:

No ship made sail next day until some shipmate
had raised a cry, three times, for each poor ghost
unfleshed by the Kikonês on that field.

That word, unfleshed, is what has stirred my mind.
Belief and hope and fear in that term dwell.
Unfleshed: the self evicted from its corpse,
to travel down the dark Hadean paths.

A multitude expects such fate on death:
the unfleshed ghost, the soul, will carry on
to heaven, hell, or maybe back to Earth.

The word “unfleshed” befits these cherished thoughts,
expresses what so many hope from death.

But what of folk like me, who don’t expect
to live on past our physical demise?
What word have we expressing what befalls?

The snowflake melts: its shape, unique, is lost.
Just so the mind, which body must sustain,
when body fails, is gone, has ceased, that’s it.
How fleeting, fright’ning, this idea of self:
ephemeral and fragile. Here, then not.

The snowflake’s stuff, of course, will still remain,
will rise, form clouds, and then will fall again.
Just so, my starstuff matter carries on.
In plants, in rocks, in future human flesh
it feeds the life of Gaia, though I’m gone.

I do not know a cure for fear of death:
I dread the tolling bells that speak my end.
But facts are not beholden to my wish.
Instead, to truth’s stark beauty do I bend.

Does all this pose a word that I can use?
A word that speaks of loss and beauty cold?
Odysseus says the soul becomes unfleshed.
For me, the flesh, the life, becomes unsouled.

(Please let me know if verse-based blogging works.
These lines, did they enlighten or confuse?
Plain prose is still the medium I prefer.
Should ever I again invoke the Muse?)

Scientific thinking at breakfast

2008/01/17

I recently read the following on a pack of Kellogg’s Fruit ‘n Fibre:

FACT: Families who eat cereal for breakfast each day are less likely to be overweight than those who don’t. 

I get the impression from their website that Kellogg’s is more responsible than many companies in making appropriately backed-up claims, so the following is not meant as an attack on them.

However, the scientist in me couldn’t help digging into what exactly they’re claiming here.

The study that backs up their claim (I assume there was one) was probably designed around a chi-square analysis – they counted the families that fall into four groups – A, B, C, and D – as in the table below:

The “FACT” quoted above amounts to the following:

And a bit of simple algebra (multiply each side by B and divide by C) gives us the following (mathematically-equivalent) espression:

This second expression could be read as follows:

FACT: Families that are overweight are less likely to eat cereal for breakfast each day than those who aren’t. 

So we have the same data, the same facts, but two subtly different statements. One suggests that eating breakfast helps prevent people from being overweight (perhaps even reverses it). The other suggests that being overweight leads people to avoid eating cereal for breakfast.

I could discuss the linguistic distinction between entailment and implicature; I could go on about the role of statistics and the need for precise language.

But remember, this was just a cereal box.

Mainly, this was simply a diverting exercise in taking something you might see every day, and using reason and some straightforward math to dig down into what’s really being said.

Humanism, and the empirical skepticism I practiced above, are not just things I bring out when confronted by outrageous pseudoscience or millennial zealots. They are part of how I see the world, day to day. They colour how I react to everything, from ads on cereal boxes to stubbing my toe to deciding what to get Deena for Valentine’s Day.

International Year of Languages

2008/01/08

I’m in the middle of analysing data, so I can’t talk long. Just wanted to mention that 2008 has been designated the International Year of Languages by the UN General Assembly.

If I were to talk about this, I hope I would say something like what This Humanist says.

I’ll also take this opportunity to explicitly list all of the linguistics-related blogs I now know of (let me know if I’m missing any):

Part of me is tempted to point out that linguistics is not immune to anti-science creationist foolishness. Another part of me is delighted that language origins are interesting enough that even pre-scientific and anti-scientific thinkers want in on the action.

And another part of me want to use this link-heavy excuse for a lazy post to point you to more reliable sources of information on how languages actually change and diversify. It’s a fascinating process, in many ways analogous to species change (and in many ways not analogous). I wonder if demonstrating the observed, documented “speciation” of languages as a result of cumulative “micro-evolutionary” steps would help some of the more honestly-deluded creationists accept the parallel phenomena in biological evolution?

Oh, well.

Here’s one last link for today – food for thought for those of us who are tempted to react viscerally instead of rationally when we encounter language change in our own community.

Watch your language.

2007/11/29

I’ve been following with interest and increasing horror recent developments on the Think Too Much blog. I’ve had a soft spot for it ever since the author declared himself a secular humanist, at least partly due to an earlier post of mine on this very blog.

Hugo’s recent post inviting “those that think they are atheists” to “drop all axioms that make you conclude ‘God does not exist’ ” crosses a line for me. It is a line that other apologists for religion occasionally cross too, when they can’t make their point another way. As a linguist, I regard language as a form of human social behaviour, and the line is crossed when people try to impose definitions or usages on language in direct opposition to the way language is actually used.

We have made “God” a label. We think “God is the creator of the universe”. By that definition, I understand why you call yourselves atheists. I did too. 

 

Yes, “God” is a label. Yes, it is created by “us”, if by “us” you mean the worldwide community of English speakers through history. Like all other words in all human languages, it is a label created by people trying to communicate ideas. Its meaning is derived from its usage – words mean what the community uses them to mean. And in this case, the vast majority of speakers of English, historically and currently, use the word “God” to mean the supernatural creator of the universe.

The Oxford English Dictionary (OED), drawing on over a thousand years of English literature, gives a multitude of related senses in which the word “god” is used (sometimes capitalized, sometimes not). The only senses that do not refer to a supernatural being are metaphorical uses that clearly depend on the supernatural meaning.

Hugo says,

God is meaning in life.
God is our morality.
God is compassion.
God is love.
God is inquisitiveness.
God is mystery, the mystery of the universe.
God is everything we cannot pen down with modernistic rationalistic terms and words.
God is our very irrationality.

This is poetic and beautiful, and I am willing to enjoy the poetry and beauty of it – as I enjoy the poetic use of God in Einstein’s “God does not play dice.” But just as Einstein’s quote becomes bad science if someone begins to take it out of its metaphorical context, so Hugo’s poetic passage becomes bad linguistics when he says

[The evangelicals] don’t know what God is. The dictionary? The dictionary does not know what God is. The important thing to note: God exists by definition.

No. The community, not the individual, is the arbiter of what “God” means. Language is a human behaviour, like a playground game that children play. Tag doesn’t change because one kid comes along and declares the rules to be changed. It only becomes a different game if everyone starts playing by different rules. Language works the same way.

Hugo tells us “You believe in love, compassion, inquisitiveness, communication, exploration? Please call that God.” No thanks. I have perfectly adequate words for those things. Words like “love”, “compassion”, “inquisitiveness”, “communication”, “exploration”.

And the thing that the rest of us mean when we say “god” still needs a label. Not just because people like Christopher Hitchens want to mock it (it’s hard to mock something you cannot name). But also because most religious believers in the world need a word to refer to the entity that they worship. God can still be thought of as mysterious and unknowable, but most worshippers still think of a conscious, supernatural (and often male) being when they use the word “God”. That’s where the meaning comes from. It’s the picture we share in our minds when we speak the word to each other.

I understand Hugo’s frustration. There are a lot of good things that have traditionally been bundled together with ideas of gods. It is natural for someone coming out of belief in the supernatural being to hope that he could keep the name “god” attached to the good things and jettison just the supernatural part of the definition.

But those things have had, and continue to have, definitions and labels of their own. What is distinctive about the label “god” is that it refers to a conscious supernatural being – in the West these days, it tends to refer to an exclusive, unitary, creator-of-the-universe conscious supernatural being.

Okay, now here’s the good news for Hugo. Meanings change. The word “queer” wasn’t about sex until the 1920s, according to the OED (nor was “gay” until the 1930s). Words change. As a speaker of Afrikaans and English, he has more direct experience of the long-term effects of that change than many of us. So it is possible that the English word “god” may come to lose its supernatural definition, and come to refer to all those things that Hugo wants it to mean.

It won’t happen because he declares it so, but he and others like him may be able to influence the community at large, to participate in some conscious language change.

Queerer things have happened.


Follow

Get every new post delivered to your Inbox.

Join 56 other followers