MongoDB Production Features

Facebooktwittergoogle_plusredditpinterestlinkedintumblr

MongoDB Production Features

A user perspective: Blueskymetrics BSMSentimizer™ – Gus Segura

We run more than one Big Data technology to support our production environment; MongoDB has been there with us from the start. Its just works! Our core use case is sentiment text analytics: product and service reviews – complaints and praises from customers. Our systems are both real-time and batch. We’ve tried other technologies and products and will continues to evaluate new features. However, for us and for now – MongoDB is our JSON Document store.

We have different sources, data types, data keys and more.  We need flexibility in our document model, sharding strategy, replication, storage, access and aggregation strategy. The following are considerations for production environment implementing  MongoDB, a document store and why we choose MongoDB.

MongoDB Overall Architecture

The operations in the oplog are idempotent, meaning they always result in the same change to the database no matter how many times they’re performed. Sharding (process name: mongos) –MongoDB also has built in horizontal scalability. This is implemented in a “shared nothing” architecture.  Shared Nothing is an important feature to us as it enables us to size – resize (optimize spend) according to the amount of data we’re aggregating.

Clustering is a welcomed scaling. The MongoDB architecture enables on-premise and in the cloud vendor of choice.  (adjust your packer – chef scripts accordingly).  We can tweak our replication – sharding strategy at will. The flexibility in our platform architecture is a welcomed change from the rigid environment of lore.

Replication vs Backup … We use both

A moment to explain some interesting technology that MongoDB Supports. Replication:Replica-Set means that you have multiple instances of MongoDB mirroring the data of another. A replica-set consists of one Master (aka. “Primary”) and one or more Slaves (aka Secondary).

“Read” – Replication. Read-operations can be served by any slave. This enables you to spread the load from the client.  If you’re application is more read intensive this can be a great solution for you.  As a backup strategy; This removes the single point of failure.  We also pipeline JSON data (mongoexport) to our spark – hadoop cluster (spark streaming: HDFS) to enable near real-time analytics.

Mongos (service) – Sharding

Sharded Cluster means that each shard of the cluster (can be and usually is a replica-set) takes care of a part of the data. Each request, both reads and writes, is served by the cluster where the data resides. This means that both read- and write performance can be increased by adding more shards to a cluster.

Where the document (stored record) resides on which shard is determined by the shard key (key space) of each collection. It should be chosen in a way that the data can be evenly distributed on all clusters. If one shard of the cluster goes down, any data on that shard is NOT available. For that reason each member of the shard cluster should also be a replica-set. The mongos (service) directs write operations from applications to the shards that are responsible for the specific portion of the data set (key space). [StackOverflow].

Write-operations (writes – updates) always take place on the master of the replica-set and are then propagated to the slaves.  Write intensive ops require some configuration and optimization techniques.

Putting it all together for Production

I’d re-read the previous paragraphs a time or two. A shard is almost always replicated at least 3x.  This implies the cluster (hardware, etc) would have to be 3x larger than for a single shard without replication. This can be fascinating and frustrating at the same time.  New Implementers normally feel the pain when they move from Development to Production even if your running DevOps (perhaps your not running a true Blue – Green environment).

We jumped on the Hadoop bandwagon early on. We were going to use Hadoop and Hive for everything.  We did our cost benefit analysis for our use case and quickly migrated to MongoDB. In our pipeline, We do allot of data cleansing and machine learning ( deep learning ) in process. We dynamically dump [ archive to our dark data lake 🙂 ] data that may have less value that our current analytics needs. MongoDB, Python, Spark, Scala, Kafka, etc. enable us.  Notice I said mongo first.  We implement text search, aggregation pipeline, dynamic document model and other key features of mongo to help us identify the data of value.

We still use Hadoop – Hive – Spark and love it! However, Our cloud spend (for Hadoop) is optimized to aggregation, data prep and final store movement.  This allow us to implement stateless Hadoop clusters on-premise and our cloud vendor of choice.

Parting Thoughts …

Our data lake architecture and data pipeline allow us to re-wind, redo and undo (stateless in the middle). We aggregate allot of data in process. The final store is performant and is a fraction of the source data after all the in process data valuation, cleansing and aggregation.  MongoDB enabled us to cost optimize our environment and we retain a technically advanced initial storage platform. Your milage may vary.

If you’re starting a new project or you need help with your existing data solution- Please Contact US.
Email: support@blueskymetrics.com – Phone: 765.325.8373 ( call / text ).

Stay Informed: Please Contact US | Subscribe to our Blog Updates or If you would like more information.

Subscribe : Blueskymetrics Blog

* indicates required,  Managed By Mail-chimp – Please check your Spam Folder and Confirm Subscription.