Posts Tagged ‘Acquia’

How Al Jazeera successfully managed through the turmoil

02 Mar

The following blog post was published as a guest blog post on I wrote it after Al Jazeera successfully moved some of their Drupal sites from their traditional hosting company to Acquia Hosting (now called Acquia Managed Cloud) to help them survive a 2,000% traffic increase as a result of the crises in the Middle East. The blog post provides real proof of how the Cloud helped one of the largest news organizations in the world survive one of the largest political events in the world. A fascinating story for Drupal!

Over the past decade, the Web has completely transformed how people create and consume information. We have all witnessed firsthand how the free flow of information is impacting the way individuals and companies communicate and how the rules of governance are changing for entire nations. Now, we’re all participating and reporting on events as they happen, and from where they happen.

There is no better example of that than the most recent events in the Middle East. And one organization, Al Jazeera, the world’s largest news organization solely focused on the Middle East, was right in the middle of the incredible broadcast and social media storm that instantly developed. Throughout the ordeal, Al Jazeera effectively leveraged the power of the cloud to stay on the air and scale its reach and performance. If events of the past few months are any indication, there are lessons here for other content-driven companies to consider for their own online operations.

Al Jazeera’s English operations broadcasts news and current affairs 24 hours a day, 7 days a week with more than 1,000 staff members from more than 50 nations. Quite literally, Al Jazeera provides the world with a front seat on the Middle East stage. It broadcasts from centers in Doha, the capital city of the state of Qatar, Kuala Lumpur, London and Washington.

Al Jazeera’s live blog site is powered by Drupal, a free, open source social publishing platform that enables content-driven organizations to publish content and build communities quickly and easily. Drupal is used by many of the world’s most prominent organizations including the White House, the World Economic Forum, Intel, The Economist and Turner Broadcasting.

Al Jazeera’s English live blog site was a vital source for breaking news in Egypt. Bloggers were posting updates from the epicenter of the crisis and social media was often the only means of communication both inside and outside of the country. During the crisis, traffic to the Al Jazeera web site increased 1,000% and traffic to the live blog spiked 2,000%. This dilemma, normally a good one for news organizations, caused unpredictable performance and excessive page load times for site visitors.

From an infrastructure standpoint, Al Jazeera had historically hosted its blog with a traditional provider but had increasingly suffered a variety of scalability issues brought on by surging demand – unacceptable for Al Jazeera or any similar content business. What might have been just a typical technical nuisance on a mundane news day quickly became unsustainable when Egypt erupted.

Al Jazeera faced a mission-critical problem that needed a real-time solution. Where could it find performance hosting and support immediately and within a reasonable cost? Would it be secure and private? What about reliable? The answer: The cloud, the various data access, storage and hosting services available remotely over the Internet. Much discussed but often not fully appreciated by the business community, cloud services enable custom sites to perform well under varying, and sometimes severe, traffic conditions. Moving to a Drupal-supported cloud option allowed Al Jazeera to scale up quickly, dynamically render their content faster, and achieve a higher level of site reliability – issues that previously overwhelmed its physical hardware environments.

By leveraging Drupal and turning to the cloud, the Al Jazeera technical team demonstrated how to rapidly turn a seemingly disastrous situation into a net positive business decision going forward. Fast forward a few weeks, and the demands on Al Jazeera’s Web infrastructure have only increased with new crises across the region. The difference is the organization is now able to better handle these unforeseen demands and focus on the core business, reporting the news as it happens.


Building blocks of a scalable web crawler

24 Dec

I recently had the pleasure of serving as a thesis advisor on a work by Marc Seeger, who was completing a portion of his requirements for a Master of Science in Computer Science and Media at Stuttgart Media University. Marc's thesis was titled "Building blocks of a scalable web crawler".

Marc undertook a project for Acquia that I had originally started in 2006; a Drupal site crawler to catalog, as best as possible, the current distribution of Drupal sites across the web. That is a task for which there is no easy answer as Drupal can be downloaded and used free (in all senses of the word). The best way to find out how many Drupal sites exist, is to develop a crawler that crawls the entire web and that counts all the Drupal sites one by one.

With Marc's help, I was able to resurrect my crawler project. Marc spent 6 months working with me; 3 months were spent in Germany where Marc lives, and 3 months were spent in Boston where Acquia is based.

During that time, Marc explored suitable architectures for building out, collecting and managing website data on the order of many millions of domains. He examined different backend storage systems (Riak, Cassandra, MongoDB, Redis, CouchDB, Tokyo Cabinet, MySQL, Postgres, ...), contemplated the methods of collecting the data while simultaneously allowing search and access. As part of his work, Marc explored a variety of different database technologies, database schemas and configurations, and experimented with various configurations of Amazon's Elastic Cloud hardware (EC2). Issues common to any large deployment were investigated and analyzed in detail, including HTTP persistent connections, data locking and concurrency control, caching, and performant solutions for large-scale searches. HTTP redirects, DNS issues -- his thesis covers it all, at least in terms of how each of these items impacted the search for an acceptable algorithm.

The crawler has been up and running for a number of months now, and investigated about 100 million domain names. Now we crawled about 100 million domain names, I plan to start publishing the results.

Marc's work is available in PDF from his blog post, and it's a good read, even if I'm slightly biased. Thanks for the great work, Marc! Time to look for a couple new thesis projects, and thesis students that want to work with me for a few months. Ideas welcome!