Pages from TechMan's notebook

Written by Ced Kurtz on .


Jottings from TechMan's notebook:

There's enough technology coming out of Carnegie Mellon University that TechMan could write every week just about that.

But then he could be accused of favoritism toward his alma mater, when in truth he can't remember the school song (if there were one) and can barely remember the school colors (burgundy and gray?)

But recently CMU-related items popped up that are worth mention.

The first is a follow-up to a recent TechMan column about digitizing newspaper archives. Former PG colleague Byron Spice, now at CMU, alerted me to a development there that is aiding in the digitizing of books and newspapers.

You've probably seen a captcha if you have logged on to any secure Web site. It is the distorted letters or numbers you have to type back into the site in order to show you are a human and not a machine.

Pioneers in captchas were Luis von Ahn and Manuel Blum of CMU; in fact, they coined the term.

Mr. Von Ahn and colleagues then went on to develop recaptchas, a way to make use of the decoding of these letters by millions of computer users.

One of the problems with digitizing text is that scanning machines cannot read some words because they are distorted in some way. Perhaps the printing is bad or the original is torn or creased. Those words have to be read by a human.

By using those words as captchas and allowing computer users to decode them as part of a Web page security process, millions of people have been enlisted in helping to digitize printed material without even knowing it.

In the first year of recaptcha use, more than 440 million words have been deciphered, the equivalent of 17,600 books. The system is currently being used to digitize the archives of the New York Times.

What a great hack.

The second CMU tidbit also has to do with making knowledge more easily available. CMU's Dan Hood is laboring to build an institutional repository for the university -- a Web site that would contain digital copies of research published by faculty members.

Faculty members routinely publish their research, but much of it appears in professional journals that have become so expensive that even some university libraries can ill afford to subscribe.

Although the researcher owns copyright to any publication of his or her work, when commercial journals publish it, they sometimes make contractual restrictions on making the work available otherwise.

Mr. Hood is trying to give the faculty a way to easily publish research on the CMU repository and to work with their commercial publishers to allow the work to appear there at some point.

The beginnings of the repository are at There are some works up there now, and Mr. Hood is hoping it will grow.

The 1911 edition of the Encyclopedia Britannica is widely considered to be one of the greatest encyclopedias ever published. The edition that marked the beginning of the transition from a British to an American endeavor, came out at a time when knowledge was exploding.

Many of its in-depth articles were written by legendary experts including Algernon Charles Swinburne, John Muir, T. H. Huxley, Ernest Rutherford and Bertrand Russell.

The 1911 has been digitized and made available on the Web at several spots, including, and

The first site has ads, the second site has really annoying pop-up ads, and the third site has many empty pages waiting for someone to transcribe the articles. All the sites are incomplete versions of the original and suffer from the ills of unedited OCR transcription. But they make fascinating reading. And all three sites allow you to become involved by correcting scanner errors.

By the way, if you ever come across a print edition of the 1911 edition at a yard or estate sale, buy it.

Join the conversation:

To report inappropriate comments, abuse and/or repeat offenders, please send an email to and include a link to the article and a copy of the comment. Your report will be reviewed in a timely manner. Thank you.