Considerations, AWS Data Warehouse, Part I



Considerations – AWS Data Warehouse, Part I

The following post is a list of considerations when developing an AWS Data Warehouse solution.  Use it as a good starting point for discussions with architects, project management and stake-holders. There are many times when starting a new data warehouse project and you’re not quite sure what should be a priority. Use this list as a set of guidelines. Every DW project will have Project Management (at some level), Data Sets (source – destination), Resources (Developers, TPM’s, hardware, software) and More…

Now, Add the wrinkle of building the DW in the Cloud on AWS with Redshift, RDS, EMR, Hadoop, Data Migration and other services. It can be quite overwhelming and complex. Over the next few post, I will expand the list and deep-dive a few of the items and hopefully, give you and your project team some great answers.

If your starting an AWS Data Warehouse project and/or have questions; Please Email Us, We’ll get back to you ASAP.

Part I, High – Level Considerations –  AWS Data Warehouse solution.

Project Management 

  • Road Map – Features – Scrum (Story Board Items)
  • Project Time Line
  • Expectations of Stakeholders
  • Completion Criteria
  • DevOps – Long Term Support

Data Sets – Data Model

  • Source Systems
  • Data Model : Definition of Core Facts – Dimensions
  • Hierarchies, Drill-Down, Drill-through
  • Aggregations –  Default
  • Default Filters – User Defined Filters
  • Extended Aggregations – UDF, UDAF [Hadoop, Hive]
  • Predictive Analytics – Data Mining
  • Business Intelligence Tools – AWS Compatibility
  • Visualizations
  • Reporting

Resources – Architecture

  • Resources – Support
  • EC2 Machine Type  – Compute, Memory, GPU … Resources
  • S3 Size : Storage Requirements –  Expected Growth
  • AWS Data Services – AWS Data Pipeline –  AWS Kinesis : What, Where, When?
    • EMR – Hadoop, [ Spark ETL ], Hive, Hbase,  Pig, others…
    • DynamoDB – Scalable NoSQL store (fast, low maintenance)
    • Redshift – Data Warehouse [large scale]
    • RDS – Relational Data Service [ scale familiar databases ]
    • ElasticCache – In Memory Cache Service -vs – Redis (my favorite)
    • DMS – Data Migration Services
  • ETL : Data Dependencies – UpStream : DownStream, ETL Windows, Tools
  • VPC – Network Considerations
  • Performance
  • Archive
  • Data Quality : Identification, Fix, Resolution, Verification – Mechanisms
  • More to Follow in Part II…

Ok, Thats a Good Start for Part I

Thats a good start… There are some parts missing like preliminary steps, gotchas, considerations for Cloud – Hybrid Architectures, Data Integration and More… Also,  If you are building an IOT EDW.. your source IOT devices are going to be in the field ..everywhere..  Sounds Obvious, But some enterprises are shocked when you talk about the deployment phase [ when you install or mount the IOT device even its Mobile ] and have to think about “All” the things involved.  Its no small list of considerations. However, it is manageable.  We have a great write-up on IOT Device Consideration.

Please Email Us , (Subject: IOT) and we’ll be happy to send you a copy of the write-up (PDF).

Cloud vs On-Premise vs Hybrid Architectures

Not everything belongs in the Cloud.  Products like Import – Export, Snowball, Direct Connect would not exist unless cloud vendors didn’t see the need for Hybrid.  For me, There are a few systems I like keep in house (on-premise)  or in my own data center.  There is a great debate of what belongs in the cloud or on-premise.  I’m hoping as an enterprise you’ve already decided that you want your EDW in the cloud and recognize your source systems are going to be in the cloud and on-prem.  I’ve been in many project meeting when the customer is still not convinced they want to be in the cloud. If you’re still un-sure and have concerns…

Please Email Us , (Subject: Cloud?) and we’ll be happy to schedule a call with you and one of our System Architects.

Final Thoughts, There are so many other NoSQL – database solutions that will run well in AWS, Azure and others cloud vendors including Cassandra (one of my favorites), MongoDB, Splunk and many more.  This series will focus on design, build and implementation of an AWS based Data Warehouse solution.  We have other years of experience of other technologies. So, if you have a favorite, Please let us know and we’ll post about those as well.  Side Note: Do you want to see how big the NoSQL – SQL database list is Today? Check out DB Engines List and Rank – Wow.

Thank you, – Gus Segura


Please Contact US or Subscribe to our Blog if you found this interesting or would like more information.

Subscribe : Blueskymetrics Blog

* indicates required,  Managed By Mail-chimp – Please check your Spam Folder and Confirm Subscription.