Monday, September 19, 2011

Notes from "Blending MongoDB and RDBMS for eCommerce"
OpenSky eCommerce platform

products in mongo using custom fields for different product types (i.e. actors in movie, tracklist for album etc)

- ordered items need to be fixed at time of purchase, table inheritance bad for this

## 3 things for e-commerce

- optimistic concurrency (update if current, then try again if document is not current)
- assumes environment with low data contention
- works well for Amazon with long tail product catalogue
- works bad for ebay, groupe, anything with flash-sales (high data contention)

# Commerce is ACID in real-life

purchasing something from a store deals with this without concurrency as each product can only be held by one customer

MongoDB e-Commerce

- each item (not sku) has it's own document
- contains
-- reference to sku
-- state
-- other meta-data (timestamp, order ref etc)

cart in card action difficult, but in Mongo changing state on item makes it unavailable to other customers (e.g. if state is 'in-cart')

## Blending

Doctrine (OS - ORM/ODM) modelled on Hibernate

in SQL
-- product inventory, sellable inventory, orders

- inventory is transient in Mongo collections
- inventory kept in sync with listeners

for financial transactions we want security and comfort of RDBMS

## Playing Nice

products are stored in a document
orders are entities stored in relational DB
store reference not relationships across two databases

Notes from "Schema design at scale"

# Eliot Horowitz

## Embedding sub-documents vs separate collection

i.e. blog post and comments

- embedded
-- Something like a million sub-documents is going to be unwieldy
-- need to move whole document (expensive) if gets large
- Not embedded
-- Million 'comments' in separate documents means lots of reads
- Hybrid
-- one document for core meta-data
-- separate document for n comments with array separated by buckets (i.e. 100 in each bucket)
-- reduces potential seeks (if 100 in each, reduces from say 500 to 5)

## Indexes

- Right-balanced index access on the B-Tree
-- only have to keep small portion in RAM
-- time based, object id, auto-increment
- Keep data sequential in index (covered index)
-- create an index with just the fields you need so you can retrieve the data straight from the index
-- index is bigger
-- good for reads like this

## Shard Key

- determines how data is partitioned
- hard to change
- most important performance decision
- broken into chunks by range
- want to distribute evenly but be right balanced, like month() + md5(something)
- if sharding for logs, need to think
-- why do you want to scale (for write or reading)?
-- 'want to see the last 1000 messages for my app across the system'
-- take advantage of parallelising commands across shares (index by machine, read by app-name)
- no right answer

## Lessons

Range query vs regex (that uses ^ - essentially 'starts-with') is about same performance
If you have a genuinely unique id, use that instead of the ObjectId

Notes from "Scaling MongoDB for Real-Time Analytics"

* Scaling MongoDB for Real-Time Analytics, a Shortcut around the mistakes I've made - Theo Hultberg Chief Architect, Burt

40gb, 50 million documents per day

use of mongo
- virtual memory (throwaway)
- short time storage (throwaway)
- full time storage

- sharding makes write scale
- secondary indexes

using jRuby

mongo 1.8.1
updates near 50% write lock incrementing single number
pushing near 80% write lock

mongo 2
updates near 15% write lock
push near 50% write lock

- 1
-- one document per session
-- update as new data comes along
-- 1000% write lock!
-- lesson: everything is about working around the global write lock
- 2
-- multiple documents using same id prefix
-- not as much lock, but still not great performance, couldn't remove data at same pace as inserting
-- lesson: everything is about working around the global write lock
-- being more thought about designing primary key (same prefix for a group)
- 3
-- wrote a new collection every hour
-- lots of complicated code, bugs
-- enables fragmented database files on disk
-- lesson: make sure you can remove old data
- 4
-- sharding
-- higher write performance
-- lots of problems (in 1.8) and ops time spent debugging
-- lesson: everything is about working around the global write lock, sharding is a way around this (4 shards means 4 global write lock)
-- lesson: sharding not a silver bullet (it's buggy, avoid it if you can)
-- lesson: it will fail, design for failure (infrastructure)
- 5
-- move things to separate clusters for different usage (high writes then drop; increment a document)
-- lesson: one database with one usage pattern per cluster
-- lesson: monitor everything (found replica that was 12 hours behind, useless!)
- 6
-- upgraded to monster servers (high memory quad extra large with AWS) (downgraded to extra large machine - 6 machines with mongod 12gb ram, 3 machines for mongo-config, 3 machines for arbiters across three availability zones writing 2000 documents per second, reading 4000 documents per secondc)
-- lesson: call the experts when you're out of ideas
- 7
-- partitioning again and pre-chunking (need to know your data and range of your keys)
-- partition by database, new db each day
-- decrease size of documents (less in RAM)
-- no more problems removing data
-- lesson: smaller objects means smaller documents
-- lesson: think about your primary key
-- lesson: everything is about working around the global write lock

- would you recommend?
-- best for high read load, high write load is not a solved problem yet
- EC2
-- you have replicas, why also have EBS?
-- use ephemeral disk, it comes included and is predictable performance (with caveats like spreading across availability zones to avoid data centre loss)
-- use RAID 10 if using EBS
- monitoring
-- mongostat
-- plugins that are available
-- mms from 10gen
-- server density (here at Mongo con)
- map reduce
-- moved away to be more real-time (but was using hadoop)

Sunday, June 05, 2011

My first scala app

Well I say first, but that’s not quite the whole story. I’ve worked on existing Scala code bases but recently got the opportunity to start a new project and I got to see how the experiences on those projects and conversations with colleges influenced design decisions. I wanted to note down some of the tricks I’d picked up.


We start with the tools and first up is SBT, version 0.7.x. We depend on the IDEA plugin to generate IDEA project files. We also use the Scaliform plugin which formats code every compile within SBT and the Sources plugin which when run downloads the source Jars without having to stipulate the ‘withSources’ directive on the dependencies.

The IDEA and Sources plugin are run as processors and are not checked into the SBT project file, whereas the Scalariform plugin is. The distinction is that a developer might choose to use a different IDE or not download the sources, but we should always try and conform to the same coding practices and being lazy, having a plugin do it for you is one less thing to worry about.

Inversion of control / Dependency injection

One of the interesting developments early on was the choice of the dependency injection frameworks. Would it be Spring, Guice or something custom? Actually, we decided not to have any and just use plain old servlets through our routing framework, Scalatra. I think this has actually lead us to write better code, for instance we have taken advantage of the companion objects for classes instead of relaying on repository classes that would need to be injected.


We started with the routing framework, Scalatra which is similar to Sinatra and other simple routing frameworks. It provides a clean way of exposing your application to the web and combined with the following, enables all of the methods to be only a few very focused lines of code. We used Scalate and SSP for the template rendering which can feel a little verbose with the variable declarations at the top, but when you can pre-compile the templates you can understand why those are required.


All new projects are using MongoDB and we’ve been using the Scala driver, Casbah, for a while but this gave us an opportunity to try out Salat the case class wrapper. This is probably as close as I’ve seen ORM like features in Scala while maintaining the MongoDB query sugar. As mentioned, using companion objects instead of repository classes allowed us to do tidy things like:

case class Artist(_id: ObjectId = new ObjectId, name: String) {
def save() = Artist.saveArtist(this)

object Artist extends SalatDAO[Artist, ObjectId](...mongo config...) {
def saveArtist(artist: Artist) = update(Map(“_id” -> artist._id, artist

And use them like:

val artist = Artist(“My new favourite band”)

This meant we were using a repository like model, but hiding it behind the case class and the companion object. It makes for very neat code further upstream.


The app has to call out to other web APIs and for this we use dispatch which wraps the Apache HttpClient project with a bow making it really simple to get resources and convert them to string, XML or JSON. We were already familiar with lift-json so we would get the response as a string and use case classes to parse the JSON into. Once again, XML is almost a joy to work with in Scala compared to other languages, but lift-json still wins in simplicity so we use JSON formatted APIs where we can.

What’s next

I’ve got my eye on Lift and Akka. I’ve had a play with Akka, but I feel I’d like to understand it a little more before using it in a project. Other than that, I think I’m generally getting my head around Scala and functional programming all the more and I’m rather enjoying it. I still feel like I’m just scratching the surface, but it is my first app remember?

Sunday, April 10, 2011


I really wanted an excuse to play with NodeJS once again and after a conversation with blagdaross in the office I had my idea. I had also been looking for an excuse to stretch the legs of Joyents hosting service for NodeJS. I had all of the pieces needed:

Once again I turned to the trusty toolset of ExpressJS for routing and EJS for template rendering. I wanted to build straight out of Twitter lists, so first thing was to hit Twitter to get the members of a list. As this is an open call, no oAuth handshaking has to take place, which is nice. However, getting the list of members returns a whole bunch of metadata about each member on the list. I didn't need any of this information, and I couldn't find all of the members without pagination. This is where I started to have some fun with writing asynchronous code for very much a synchronous web request.

Once I had the list of members of a list, I could do a lookup on Cursebird for the points they have given those members. I have to say the Cursebird API is just fantastic! Talk about the API being the website, just stick .json on the end of any page you visit and get the data you need. No API key nonsense for open data is also a great move. Well done to Cursebird on their efforts. After sorting the list according to points, all it takes is rendering.

I recognised that crunching all of this data with potentially a few hundred server side calls to Twitter and Cursebird would be slow, so I introduced a really dumb key-value pair cache and use that cache at every level, from caching a single request to Twitter or Cursebird, to caching a whole dataset for a webpage.

Onto hosting, and has this really interesting deployment technique where you do a git push into a repository you're given on the virtual server you have. From there a post-push hook is fired which deploys your app and runs a named file (server.js). The real time analytics are really impressive, although it'd be great to get historical information out of that too.

The one on-going grievance I'm dealing with is locally I deploy on port 3000 but in production I deploy on port 80, and currently I'm changing this on commits as a work. I had a way of doing this through a node module I'd built a while ago to get properties from a json file, although up until now I'd been using nDistro to manage my dependencies, but with I'd have to do this a little differently. I could push the properties project into the NPM repositories but I'm also thinking of building a node module that can read nDistro files and add dependencies at run-time. It's food for thought.

Finally a big thank you to JT for working up the styling.



Monday, February 14, 2011

Guardian SXSW Hack Review

What I built and why

I really wanted to scratch my own itch at the hack weekend with SXSW in mind. There is going to be over a thousand bands playing at the music festival, and many of those will be trying to break through and make it. This means, there’s a fair chance I won’t have heard of many of the bands. Also, the sheer number of bands playing means it’ll be difficult for me to do any quick research about those bands to find out who I should go and see play.

I’ve used Last.FM for nearly four years and have something like approaching 20,000 scrobbled tracks in their dataset, so they have a good impression of the type of music and artists I like listening too and I wanted to tap into that data. Bearing in mind what I just said about not knowing any of the bands playing, I needed a different way to look at the data. Using Matt Andrew’s band listing API, I used the Last.FM API to find around 20 similar artists for each band playing at the festival, leaving me with a dataset of about 20,000 artists that are like those playing at the festival. My thinking here was there would be a good chance of bigger bands would be amongst this dataset and I might have more of a chance of finding a match.

Now, using the top artists from the Last.FM API, I could do an intersection of the artists I like and artists similar to those playing at SXSW. I could then do a look up to see what bands are similar to bands I like, helping me discover new music at the festival without investing too much time doing any research on all of the artists. Yes, it gave me a bit of a headache too.

I only really got this working by lunch time on the second day, and although I had intended to build a simple HTML layer on the data I’d built, I just didn’t have the energy. I went for a coffee, a chat and a nice sit down instead. I don’t think it helped tremendously with the presentation not having any visuals, but I think a few people saw the potential of the application and the data behind it.

What’s next

The hack day is done, but I want to push this project on a little more. The code certainly needs a lot of love and I wasn’t using a complete dataset over the weekend, so I need to import that to get more comprehensive results. I’d also like to make this more generic for any festival, be it SXSW or Glastonbury.

Lessons, observations and stories

The hack weekend was only my second hack event, and the first one I went too was only for the first day of a two day event, so I suppose this was my first full hack event. I’ve been to plenty of small meet-ups, larger Bar Camp events and mammoth conferences, but a hack day definitely has a different and more intimate vibe too it. The number of attendees seemed pretty optimal and it was certainly a good mix of developers, designers and journalists although this only really shone through in the final presentations.

Although having an idea of what I was intending to build at the event, and even doing some thinking about how the application might work (but no coding before the pistol on Saturday morning, as that’s cheating!) it still took me a while to get started. I probably would have coded my idea up in Node.JS as it’s quickly becoming my language of choice in the ‘get something done fast’ category, but as I was building something that I hoped would eventually be brought into the Guardian for festival coverage I thought I should build the application in the fast becoming language of choice at the Guardian, scala.

Scala, although fantastically awesome on one hand, still runs on the JVM and still requires Java like setup with web.xml files and the like. This isn’t a problem in itself, but it means some non-trivial time is spent just setting up a project the way you like it.

This leads nicely into lesson one:
If intending to use a language that requires some investment in setup, look for a way of reducing this, perhaps by having a library of “hello world” applications pre-configured for repeat use.

Keeping with Scala, as a relative newbie to the language I know I was doing certain things in an inefficient way and that was compounded by the timescales of an event like this. When working on a project and you suffer a setback, either you don’t know how to do something or something you thought would work just doesn’t, you generally have time to investigate, ask around and try a few things out. However, that is a real luxury when trying to build things in hours and not weeks.

Lesson Two:
If you’re going to use a hack day to experiment with a new technology, expect frustrations and delays. Even a half hour delay can feel catastrophic.

Just to finish the Scala points, and this may be my lack of knowledge of the language, but I wish the JSON and HTTP support were a little better. Compared to the XML support, which is an excellent xpath like implementation, the JSON support felt clunky. I actually had to change the data I was getting (I was reading from a file, so I could do this) to remove some parts which I couldn’t get the code to parse. As for the HTTP support, I had to bring in the Jetty HTTP client (which didn’t seem to recognise ‘utf-8’ as a character encoding), then bring in the Apache Commons HTTP client to request data from the Last.FM API. One post on Stack Overflow I was reading while I was looking for answers to a Scala problem suggesting having a personal library for wrapping functions you wish were supported better.

Lesson Three:
Knowing how to do common things really well, and fast is essential. In my case, using web based APIs using JSON and rendering JSON in turn.

One technology I did use which I’m now using on a day to day basis is MongoDB and this is where knowing something about a technology really came into it’s own. Getting stuff into and out of the database was as easy as it should be. I used the effective but perhaps slightly noisy Casbah Scala driver to talk to MongoDB. I was also using the Last.FM API to get information about bands and I realised that the Last.FM API was probably one of the first public web APIs I used thanks to a Paul Downey workshop when I was a fresh faced graduate. In a good way the API doesn’t look to have changed, which should be great as anything I built in that workshop might have a chance of working today. However, the web has moved on a bit from RPC and XML and I’d like to see Last.FM offering more of a RESTful JSON based API. That might just have influenced my decision to use Node.JS instead of Scala due to the support of JSON over XML.

As this was a music project, the sensible thing was using Music Brainz ids for the bands, and although this worked in many instances some of the bands playing SXSW don’t yet have Music Brainz ids and perhaps more surprisingly the Last.FM API doesn’t seem to provide Music Brainz ids for all of the bands in it’s API, even top 100 chart bands can be without one. The algorithm I built depended on this data, and although I could do a best effort and call Music Brainz directly, it would have been nice if this was covered off by the larger provider, Last.FM.

Update: I've already found out Last.FM does actually offer JSON in it's API, that would have been handy to know, and sorta proves the lesson of know what you're doing before starting.

Monday, December 20, 2010

Cache me if you can

From the producers who brought you 'Tengo and Cache', we present 'Cache me if you can'.

The concept is all about HTML delivery into iOS devices alongside a mechanism for updating the application without going through the Apple app store. Loading HTML from a local file is not a new idea, as a few companies and projects have sprung up around this type of mobile app development have emerged, most notably Phonegap and more recently Apparatio.

The original iOS app I had built, Tengo and Cache, created a way that took this idea in a slightly different direction. The projects above deliver HTML within an application and would rely on Javascript to update any content within the application, using local storage to persist data. What Tengo and Cache offered was a way to download new HTML documents in their entirety, storing the files in writable areas of the file system on a device. This worked by providing a manifest file, based on the HTML5 manifest, on the domain which was the iOS app was trying to download files from.

During a hack day for the Guardian, I extended this work to add a further downstream cache. Although every effort is made to cache resources before loading, some files may be requested which are not included in the manifest. Overriding the NSURLCache class, the application can intercept any calls which attempt to go out to the web. This cache then retrieves the file, stores it, and then serves this instead of continuing down the pipeline. The next time this file is requested, the cache serves the file on disk instead of hitting the web at all.

Now that I had an iOS application that should pre-cache and intercept, I needed some content to install onto a device. I had wanted to use a Wordpress blog, or RSS feed, but felt that this content could prove difficult. Without knowing what content would be delivered, I couldn't build a manifest file that would capture the entire content. Also, I thought it might prove to be bad user experience if something that would work on a web page when online might not work offline. Even something as ubiquitous as search might look to have failed miserably.

I decided to use something I knew I had control over, the Guardian Content API. Using a small NodeJS application, I could retrieve a query and present that as HTML, alongside an appropriate manifest file. Then all that was required was a small property change in the iOS app and a native application was ready to be launched.

Tuesday, November 23, 2010

Samsung Galaxy Tab - Review

The first thing you notice when you start using a Samsung Galaxy Tablet is that it clearly thinks it’s a phone. Most people I show the device to also think it’s a phone as they do their best Dom Joly impersonation. I had only intended to use the device over wifi but I’m constantly reminded that I haven’t put a SIM card in, and that the phone can only make emergency calls. I noticed at one point that a significant portion of battery life was being put to use on phone related activities. At that point, I put the device into flight mode and then enabled wifi to try and make the battery last longer. Trying not to use the device as a phone means many of the Samsung applications simply don’t work as they require a SIM card, although why they need the SIM card I couldn’t tell you.

One of the things I was looking forward to doing with the device when I got it was to use it around the office as I’m using a desktop and I wanted a way to take notes, look things up and do demos. My first snag there was that the electronic keyboard is just too small for any note taking at length and when I held the device in portrait mode I was typing by thumbs alone. I’ve found that the screen on the Galaxy is just about right to web pages, and due to the size, I can hold the device in one hand, not unlike the Kindle. Many websites are redirecting me to their mobile version, which although can look very nice on the size of the device, it sort of deflects the purpose of having a tablet as opposed to a phone.

Some of the plain oddities are exposed when trying to use Google Docs on the phone. Having sent you to the mobile version, you can edit documents and spreadsheets, but not presentations. When trying to change to desktop mode you get a bizarre error about the browser not supporting web word processing. There was me thinking that all that was needed was HTML rendering, Javascript processing and an Internet connection.

I have felt I’m getting benefit when you consider just about how ‘cloud-enabled’ the device is. I’m yet to plug the device in to my computer, apart from trying to deploy an application I’d built onto it. For music I’ve been using Spotify, Last.FM or iPlayer for Radio 6. Tools like Google Voice have let me find, download and play podcasts without needing desktop software like iTunes with varying success. The speakers on the device are just about good enough to carry around the house with you, or plug the headphone jack into stereos in different rooms. I’ve used DropBox to drop video files onto the device and they just play without having to configure anything or install any codecs.

I’m generally enjoying being able to think of this as more of a computer than a locked down device though, by seeing the running applications and being able to navigate the file system. However there seem to be simple things I just can’t find, like changing the auto-lock timeout or being able to wake the device from being locked other than pushing a button at the top of the device, when my hands tend to be at the other end of the device.

I’m happy that Apple has some competition in the tablet market, but I think the experience needs to improve a lot, through both the physical device and the software that powers it. I’d suggest a good start would be to get the device to stop thinking it’s a phone. The Amazon Kindle has a SIM card for 3G access, but it’s knows it isn’t a phone. It may be a small point, but I prefer my devices without the split personality.

Saturday, September 11, 2010

nodejs, ndistro and git submodules

On the (very early) tapas platform at theteam I got a little stuck when mixing nDistro and git submodules and wanted to explain what I'd done to get around those issues. On the tapas-models module, MongoDB is used for data storage and tapas-models uses Mongoose for MongoDB integration, but nDistro doesn't download Mongoose's downstream dependencies

This is because nDistro downloads the tarball of the project from Github and that tarball doesn't include dependencies. It might be nice for Github to do this, and I'll search the Github Support site to see if I can find something about that. Anyhoo, the tarball doesn't contain it's git bindings, so I can't go in there and update the git submodules.

The unfortunate thing is I've had to expose myself to some of the Mongoose inner workings to get the dependencies, but once I know them, it's a light touch for the next part. Using nDistro as normal, I include that dependency, including the revision number in my .ndistro file. As nDistro is executed in a bash environment, shell scripting can be used alongside the module declarations. So, I use a Linux move command to put the dependency where Mongoose expects it and everyone's happy.

It's a simple solution to what was a nagging issue and keeps me on my happy path toward NNNN