New language auto-detection over Blogs

February 19, 2009

We are pleased to announce the upcoming launch of improved language detection for blogs in the UGC Metabase in two weeks. We’re also introducing new blog lists sorted by language, so you can see all the English, French, German, Chinese blogs, etc, in our index.

And we’re adding a new date field, showing the time we indexed a particular post. This is in addition to the publish date already provided, as copied from the original XML/RSS feed.

 

1. Improved language detection at post level 

Blog feeds normally state which language they are in. However, this isn’t always reliable – typically blog publishing platforms have a default language setting, and bloggers do not always update their blogs to give their local language. The result is a significant portion of blog feeds with the wrong language. 

We’ve been working hard in the background to produce a more reliable approach to language detection. We’ll be rolling this out next month as the basis for setting the post’s language, as provided in the <language> tag. Only when this approach is unable to confidently determine the language, will we revert to using the language tag provided in the original XML as fallback.

 

2. New language tagging at feed level

 Further to this, we are adding a new <feedLanguage> tag, showing the language of the blog feed. This is in addition to the existing <language> tag referred to above, which is at post level. 

Adding language categorisation at feed level makes it possible to better organise the index by language – for example we can identify exactly which blogs are in French, which are in English, etc, and provide and manage these in lists.

The new language tag will appear in the UGC XML as follows

<feedLink>http://blog.moreover.com/feed/</feedLink> 
<feedLanguage>English</feedLanguage>
<generator>http://wordpress.org/?v=MU</generator>

 

3. Introducing a new Harvest Date field

Lastly, we’re adding a new <itemHarvestDate> field to the feed. This gives the time Moreover actually indexed the item. We already pass on the publish date of the post, as provided in the original XML/RSS feed — The new index time complements this tag and can provide, for example, additional information about the latency of indexing as it occurs across the feeds.

The new harvest date tag will appear in the UGC XML as follows:

<pubDate>2009-02-11 14:26:06.0</pubDate>
<itemHarvestDate>2009-03-13 18:38:21.0</itemHarvestDate>
<validDate>2009-03-13 18:37:18.0</validDate>

All times are shown in GMT.

 

We believe in being open and transparent about our crawling performance, and are confident about our technology. We invite comparison with other, similar services (for example, see Technorati and a recent comment on ReadWriteWeb), and welcome any feedback you, as customers and users, have.

.

Entry Filed under: aggregation services, aggregator, blogs aggregation, search engine products, social media. Tags: , , , .

1 Comment Add your own

  • 1. FinancialServicesRenoNV  |  March 28, 2009 at 11:00 pm

    Greetings all members,

    I would just like to say hello and let you know that I’m happy to be a member – been a lurker long enough :)

    Hope to contribute some and gain some knowledge along the way….

Leave a Comment

Required

Required, hidden

Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <pre> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Trackback this post  |  Subscribe to the comments via RSS Feed


Moreover Technologies

Moreover links

Top Posts

RSS Editor’s pick…

Tags

BBC blog ranking blogs blog search blog search engine blogs list Digg enterprise 2.0 Facebook free feeds free RSS French blogs FriendFeed fun Google hiring hyperlocal media monitoring Monitoring Social Media 09 Moreover Moreover.com Moreover Technologies MSM09 news Newsdesk news feeds online news Premier League publisher real-time RSS RSS feeds search social media social media monitoring social networking survey top bloggers top blogs Twitter US city feeds vacancies Web 2.0 web trends YouTube


  Bookmark and Share
wordpress counter

Pages

 

February 2009
M T W T F S S
« Jan   Mar »
 1
2345678
9101112131415
16171819202122
232425262728  

Archives

Administration

Latest Tweets