Monday, September 19, 2011

Notes from "Blending MongoDB and RDBMS for eCommerce"
OpenSky eCommerce platform

products in mongo using custom fields for different product types (i.e. actors in movie, tracklist for album etc)

- ordered items need to be fixed at time of purchase, table inheritance bad for this

## 3 things for e-commerce

- optimistic concurrency (update if current, then try again if document is not current)
- assumes environment with low data contention
- works well for Amazon with long tail product catalogue
- works bad for ebay, groupe, anything with flash-sales (high data contention)

# Commerce is ACID in real-life

purchasing something from a store deals with this without concurrency as each product can only be held by one customer

MongoDB e-Commerce

- each item (not sku) has it's own document
- contains
-- reference to sku
-- state
-- other meta-data (timestamp, order ref etc)

cart in card action difficult, but in Mongo changing state on item makes it unavailable to other customers (e.g. if state is 'in-cart')

## Blending

Doctrine (OS - ORM/ODM) modelled on Hibernate

in SQL
-- product inventory, sellable inventory, orders

- inventory is transient in Mongo collections
- inventory kept in sync with listeners

for financial transactions we want security and comfort of RDBMS

## Playing Nice

products are stored in a document
orders are entities stored in relational DB
store reference not relationships across two databases

Notes from "Schema design at scale"

# Eliot Horowitz

## Embedding sub-documents vs separate collection

i.e. blog post and comments

- embedded
-- Something like a million sub-documents is going to be unwieldy
-- need to move whole document (expensive) if gets large
- Not embedded
-- Million 'comments' in separate documents means lots of reads
- Hybrid
-- one document for core meta-data
-- separate document for n comments with array separated by buckets (i.e. 100 in each bucket)
-- reduces potential seeks (if 100 in each, reduces from say 500 to 5)

## Indexes

- Right-balanced index access on the B-Tree
-- only have to keep small portion in RAM
-- time based, object id, auto-increment
- Keep data sequential in index (covered index)
-- create an index with just the fields you need so you can retrieve the data straight from the index
-- index is bigger
-- good for reads like this

## Shard Key

- determines how data is partitioned
- hard to change
- most important performance decision
- broken into chunks by range
- want to distribute evenly but be right balanced, like month() + md5(something)
- if sharding for logs, need to think
-- why do you want to scale (for write or reading)?
-- 'want to see the last 1000 messages for my app across the system'
-- take advantage of parallelising commands across shares (index by machine, read by app-name)
- no right answer

## Lessons

Range query vs regex (that uses ^ - essentially 'starts-with') is about same performance
If you have a genuinely unique id, use that instead of the ObjectId

Notes from "Scaling MongoDB for Real-Time Analytics"

* Scaling MongoDB for Real-Time Analytics, a Shortcut around the mistakes I've made - Theo Hultberg Chief Architect, Burt

40gb, 50 million documents per day

use of mongo
- virtual memory (throwaway)
- short time storage (throwaway)
- full time storage

- sharding makes write scale
- secondary indexes

using jRuby

mongo 1.8.1
updates near 50% write lock incrementing single number
pushing near 80% write lock

mongo 2
updates near 15% write lock
push near 50% write lock

- 1
-- one document per session
-- update as new data comes along
-- 1000% write lock!
-- lesson: everything is about working around the global write lock
- 2
-- multiple documents using same id prefix
-- not as much lock, but still not great performance, couldn't remove data at same pace as inserting
-- lesson: everything is about working around the global write lock
-- being more thought about designing primary key (same prefix for a group)
- 3
-- wrote a new collection every hour
-- lots of complicated code, bugs
-- enables fragmented database files on disk
-- lesson: make sure you can remove old data
- 4
-- sharding
-- higher write performance
-- lots of problems (in 1.8) and ops time spent debugging
-- lesson: everything is about working around the global write lock, sharding is a way around this (4 shards means 4 global write lock)
-- lesson: sharding not a silver bullet (it's buggy, avoid it if you can)
-- lesson: it will fail, design for failure (infrastructure)
- 5
-- move things to separate clusters for different usage (high writes then drop; increment a document)
-- lesson: one database with one usage pattern per cluster
-- lesson: monitor everything (found replica that was 12 hours behind, useless!)
- 6
-- upgraded to monster servers (high memory quad extra large with AWS) (downgraded to extra large machine - 6 machines with mongod 12gb ram, 3 machines for mongo-config, 3 machines for arbiters across three availability zones writing 2000 documents per second, reading 4000 documents per secondc)
-- lesson: call the experts when you're out of ideas
- 7
-- partitioning again and pre-chunking (need to know your data and range of your keys)
-- partition by database, new db each day
-- decrease size of documents (less in RAM)
-- no more problems removing data
-- lesson: smaller objects means smaller documents
-- lesson: think about your primary key
-- lesson: everything is about working around the global write lock

- would you recommend?
-- best for high read load, high write load is not a solved problem yet
- EC2
-- you have replicas, why also have EBS?
-- use ephemeral disk, it comes included and is predictable performance (with caveats like spreading across availability zones to avoid data centre loss)
-- use RAID 10 if using EBS
- monitoring
-- mongostat
-- plugins that are available
-- mms from 10gen
-- server density (here at Mongo con)
- map reduce
-- moved away to be more real-time (but was using hadoop)