No, it’s not one of my projects! But I kinda wish it was.
National Institute for Technology and Liberal Education has a fascinating project: the NITLE Blog Census:
Despite all the recent interest in blogging, few hard numbers are available about the extent of the phenomenon, particularly in languages other than English. The NITLE Blog Census is an attempt to create and share a regularly updated database of all known weblogs.
The census has been active since early May, 2003.
Our crawlers search the Web for weblogs, and attempt to categorize them by language and authoring tool. Data gathered during the census is archived every two weeks, and is available for non-commercial use. Our software respects the usual robots.txt exclusion rules. If you do not wish your weblog to be included in our surveys, please contact the site maintainer and we will expunge your site from our records.
The NITLE team has clearly given thought to the methodology: they are combining algorithmic methods of link-crawling with feeds from sources like Weblogs.com and various weblog directory lists.
And even better: it’s an open effort! You can download their data set in various forms, or use an XML-RPC API to do targeted queries.
Spiffy stuff, indeed.