Data Sets, Data Model – AWS Data Warehouse, Part III-A
The following post is a list of considerations when developing an AWS Data Warehouse solution. Use it as a good starting point for discussions with architects, project management and stake-holders. There are many times when starting a new data warehouse project and you’re not quite sure what should be a priority. Use this list as a set of guidelines. Every DW project will have Project Management (at some level), Data Sets (source – destination), Resources (Developers, TPM’s, hardware, software) and More…
Now, Add the wrinkle of building the DW in the Cloud on AWS with Redshift, RDS, EMR, Hadoop, Data Migration and other services. It can be quite overwhelming and complex. Over the next few post, I will expand the list and deep-dive a few of the items and hopefully, give you and your project team some great answers.
If your starting an AWS Data Warehouse project and/or have questions; Please Email Us, We’ll get back to you ASAP.
Part III, Data Sets – Data Model, AWS Data Warehouse solution – A.
Data Sets – Data Model
- Source Systems
- Data Model : Definition of Core Facts – Dimensions
- Hierarchies, Drill-Down, Drill-through
- Aggregations – Default
- Default Filters – User Defined Filters
- Extended Aggregations – UDF, UDAF [Hadoop, Hive]
- Predictive Analytics Tools – Data Mining
- Business Intelligence Tools – AWS Compatibility
Source Systems (up-stream data)
This is typically your OLTP source system like: Salesforce, PeopleSoft, Inventory, Shipping, IOT Field Devices – The list goes on. This is the up-stream system from which you are going to push or pull data. Not all systems offer the same level in integration or interconnectivity. Some offer a fancy API that require you to hire developers to move data from system to another. Others have Import and Export that provide a cryptic command line and the ability to move a WHOLE bunch of data. Most are some where in the middle. Have no fear – Most Data Scientist and/or Data Engineer make a living knowing how to move data. We have several tools and techniques in our toolbox. Also, Today – There are plenty of cloud tools that will make this even easier. However, this functionality does have a cost. It’s up to you and your stake-holders to determine the ROI.
Each source system should typically represent a Data Set. Usually, A set of data element that are specific to that area. I.E. From Salesforce.com – You may pull CRM Data [Salesforce ID, Customer Name, Address ..etc.], From Inventory – [ Stock on Hand, Turns, Logistics ], From IOT Devices – [Speed, Temperature, Current]. These Data Sets need to be “Re-Constituted” in the Data Warehouse in a way [Modeled] that is meaningful, scalable, supportable and retain quality. [Ensuring Data Quality is a big part of maintaining both source and destination systems].
Wait, What about Kafka, Storm, Spark Streaming, Micro-batching … etc? What about un-structured and structured data? What about dealing with extremely large scale? Ah, That is a big part of using those API’s and how Timely you want your data. These are all separate topics. Please follow and share my Blog. Over the next several post, I will do my best to share recent work in these areas.
Wiki defines A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to properties of the real world. Your stake-holders may not understand the intricacies and functionality of a Data Model. Thats your job. Some how – Some way this model should be captured. It can exist in many forms and be built with many tools. Most enterprises will use ERD – Entity Relationship Diagrams and/or Data Diagrams [JSON] to represent the Model. Be flexible in how you capture the model. A Meta-Data dictionary and/or other tools can be just as valuable as an ERD.
The good news is that most dev shops will have some type of diagramming tool and if they don’t there are plenty out there that will build the model (reverse engineer) of an existing system for you. The reverse engineer diagrams are ok. The ones you build with your stakeholders are better. Also, The Destination EDW model should be straight forward to develop as most (even those systems built in NoSQL) will represent a De-Normalized structure with a Fact Table at the center and Dimensional Tables about (around) the Fact [also called a Star Model] (Used for Aggregation and Drill Throughs). Note: The representation of this physical structure in NoSQL is just a bit harder to determine based on the destination system.
That’s it for Part III section A. In the next section, I will cover aggregations and default filters. Again, There are some great tools available today and it is getting easier for everyone to build great analytics.
If you have questions – Please Ask. You can Email Me – I will respond ASAP.
Please Contact US or Subscribe to our Blog if you found this interesting or would like more information.