Showing posts with label statistics. Show all posts
Showing posts with label statistics. Show all posts

Sunday, September 20, 2009

Statistics don't lie. People who use them do.

Mark Liberman makes his latest run at the zombie that won't die: the significance of the changes in the difference between male and female happiness. Douthat, Dowd, Leavitt, Leonhardt and Huffington are not to be trusted on these numbers.

A general word of advice: if a newspaper article or opinion piece tries to use a study and its statistics to make an interesting point, don't believe it; investigate it.

Liberman does a nice job of accurately presenting the statistics in a manageable form.

I'll do my best to capture his point. This idea:


Is mostly BS.

But the best part of Liberman's post: the mouse-over text on the first graph.

Saturday, June 13, 2009

Crossing the threshold of hype

If you watched the video in the last post and you follow reputable language bloggers, you could probably guess that what caught my attention was Craig Crawford's acknowledgment of the Global Language Monitor's claim that English now has one million words.

But acknowledgment doesn't sound right to me. How about 'Crawford's duped acknowledgment...'? That's more like it.

I've written about Payack before, and so have the important language bloggers. The arguments haven't really changed, but they're worth repeating.

A relaxed definition of word would easily lead to several million words in the English language. At any point that you decide to limit the definition of word you've got an argument to make. Do we count rock and rocks as separate words? How about mouse and mice? How about the different tenses of verbs?

Once we get past such grammatical distinctions we have the hard part. Certainly teeter-totter and seesaw and hickey-horse should be counted as words distinct from each other, but what about potato bug referring to the Jerusalem cricket and potato bug referring to the woodlouse? Is potato bug1 distinct from potato bug2?

And then we have words like tubular which used to mean resembling a tube then during my childhood I learned it as cool, far out, groovy, outasight. One word or two? Does the second meaning even count as a word? How do we count slang?

How about ginormous? Fucktastic? Krunk? Bevemirage? The arguments about what is and what isn't a word immediately dissolve Mr Payack's claims that on June 10, 2009 at 5:22 GMT the millionth word entered the English language. The only way this determination is even theoretically defensible is if Payack and and his algorithm were able to account for ever slang word and every bit of jargon and every portmanteau and sandwich word and regionalism and simply say when you count everything without argument about what should be counted, there are X words known to and used by English speakers.

And that's only theoretically possible. And the count would be many times what Payack says it is. Especially if phrases like "wardrobe malfunction" are counted as words. How about other compositionally predictable items like "terrorist attack" or "computer program"? If they occur together enough, are they single words in addition to the individual words they comprise?

But he claims that his number is only an estimate and it's meant to celebrate the globalisation of English. We already know that English is global and we could have celebrated it a long time ago. And there's no reason to celebrate the threshold now just because he has marked the date.

According to a barely skeptical CNN.com story

[Payack's] computer models check a total of 5,000 Web sites, dictionaries, scholarly publications and news articles to see how frequently words are used, he said. A word must make 25,000 appearances to be deemed legitimate.


So it's a late celebration if we decide a word needs 10,000 appearances from 10,000 sources. And it's a very early celebration if we decide 30,000 appearances on 2,500 sources is necessary. And that is if we agree on a standard of word-form count.

Craig Crawford's home turf is CQ Politics, not Language Log or Visual Thesaurus. So we can't expect his bullshit sensor to be as well-tuned on issues of lexicography. But there is a tendency to believe a sparkly press release merely because it would be cool for it to be true. And the coverage of Payack's pronouncement has been more eager than investigative. The linguists are usually included as mere dissenters: stingy academics stifling the entrepreneurial spirit. There are exceptions.

A BBC4 segment pitted Payack against Ben Zimmer on level ground. With the opportunity to speak plainly in response, Zimmer shut down the claims pretty easily. When PRI's The World reran the story the silliness of such claims was pushed even further to the fore with David Crystal's reasonable voice adding some lovely and firm criticism.

The relevant segment takes up the first 10 minutes.



Even the host, Patrick Cox, speaks with a clearly dismissive tone, not just of Payack, but of the headline writers who were "the only people who seemed to like the story and the declaration."

Bravo Mr Cox. Bravo.

Saturday, January 12, 2008

Lies damn lies and sports

When I started following hockey and college football back in the Eighties I learned of the necessity for a difference between a winning streak and an unbeaten streak. (The same distinction is necessary in the NFL but ties are not common enough to make the difference salient.) For a while I thought it was just an optimistic and pessimistic way of saying the same thing. Then I learned that the semantics went beyond attitude.

So a winning streak does entail an unbeaten streak; and when a team ties a game the unbeaten streak continues but the winning streak is over. So an unbeaten streak is less impressive than a win streak. But it's not too shabby.

Because of the point system in hockey it was a relevant statistic. The team got two points for a win one point for a tie and no points for a loss. Well that changed several years ago. Ever since the 1999-2000 season a team earns one point if the regulation period ends in a tie even if the team loses the game during the overtime period. Of course the other team gets two points for the win. And now there are no more ties. The game will always be decided; if necessary by a shootout.

This has introduced a new phrase for another somewhat relevant streak: the point streak. Even if a team loses a game the streak is still intact as long as the regulation period ends in a tie. A winning streak entails an unbeaten streak which then entails a point streak. And of course a point streak is less impressive.

And as far as headlines and sports commentary go there are several other streaks entailed by all these so far mentioned.

One AP headline today announces the end of one team's streak that isn't very significant and really not too impressive:

  • Canucks' 14-game home point streak ends


  • Over the course of those 14 home games (which stretch back to Nov 9) the Canucks have also played 14 away games. Their record in those away games is 5 wins 7 losses and 2 overtime losses. Combined with their home record over that period (before today's loss) the Canucks have won 17 games and lost 11. Four of those losses were tied at the end of 60 minutes so they get moved over into the points column.

    There are several ways to make their performance sound impressive. One line late in the story extends the span of the accomplishment by mentioning that their previous home loss was back on November 1. That statistic give the illusion of adding more than a week to their streak even though their next home game (the game that started the streak) was 8 days later.

    Another trick to make their performance sound more impressive: They're second in their division: the Northwest division of the Campbell Conference. But they're technically the 5th place team in the conference. And they're 18 points behind the conference leading Detroit Red Wings.

    Sort the league into 2 conferences and it's possible for the best team in one conference to fall below the 50th percentile overall. Divide each of those categories into three divisions and it's possible for the best team in a division to be around the bottom 15th percentile.

    In sports such a case is unlikely. And this isn't such a case. The Canucks are not a bad team. And they haven't been playing badly over this span of time. They're comfortably over .500 in their record. But it's been no more than a mediocre run.

    17-11: and they get a headline calling it a streak.
    _

    Sunday, September 30, 2007

    Don't breathe the water


    An AP story about a brain eating amoeba is making the rounds. Why? because there's been a recent spike in the number of cases reported. Infection is around ninety-seven percent fatal--only three survivors ever reported. It crawls up your nose and follows the olfactory nerve to your brain causing headaches and hallucinations and finally death.

    Sounds pretty scary. And the spike this year was enormous. In 2007 Primary Amoebic Meningoencephalitis was reported to have caused about two-hundred percent more deaths than the yearly average from 1995 to 2004. The amoeba-- Naegleria fowleri --thrives in warm stagnant water. So global warming is expected to make it worse.

    I've borrowed some of the language and some of the statistical rhetoric from all the stories I've seen. Brain-eating is a favorite in these stories. It's up there with flesh-eating in its impact. Maybe better. It's a lot better than infection. It sounds a lot scarier and more deliberate on the part of those ravenous brain eating mini-monsters.

    And the statistics: In the story I linked to above Chris Kahn writes "The spike in cases has health officials concerned." Infections are exceedingly rare. So a spike is a complicated phenomenon. From 1995 to 2004 there were 23 reported cases. That's just over 2 per year. 2007 saw 6 deaths. About four more than the average. But is that a spike? Snooping around the CDC website I found that there were 6 cases in 2002 as well. Two each in Texas Arizona and Florida.

    I've not been able to find the numbers for year to year cases tho I'm quite sure that they would make 2007's "spike" look less alarming.

    But then how would the story make it into all the newspapers and TV newscasts?

    _

    Monday, January 29, 2007

    I still haven't seen Gilroy

    I like to look through my stats every once in a while (read: 6 times a day) to see who has been coming to my page. I use StatCounter to tally the visitors. It's a good meter. One of my favorite features is the Recent Visitor Map. But what I follow most are the keyword statistics. The counter provides a link to the pages that refered people to your page. When the visitor comes from a search engine the search terms are reported.

    The search terms that have led the most visitors to my page have been related to banshees and the pronunciation of babel. There was a short time when pluto/ed and Word of the Year brought in a gaggle of Googlers. Burr in reference to winter and cold has been a steady term for a few months now.

    Some searches are amusing. A few days ago someone came to my page asking "what's a good dare for a gf?" Use your imagination buddy. You'll come up with something. I have considered posting a little text box in the sidebar giving the most amusing search term of the week.

    Today's find is not a search term but a page. Apparently there's a Pig Latin Google out there. Click here to take a look.

    I notice on the search button that somebody named Peter probably designed the theme. I also notice that Google is never converted into Pig Latin. There are probably two reason for this.

    1) The integrity of the trademark must be preserved. This is the prime directive.
    2) Something about Ooglegay might not be quite right for mainstream marketing.