AWS DW, Aggregations, Default Filters – IIIB

aws_edw_aggfilt

Facebooktwittergoogle_plusredditpinterestlinkedintumblr

Aggregations, Filters – AWS Data Warehouse, Part III-A

The following post is a list of considerations when developing an AWS Data Warehouse solution.  Use it as a good starting point for discussions with architects, project management and stake-holders. There are many times when starting a new data warehouse project and you’re not quite sure what should be a priority. Use this list as a set of guidelines. Every DW project will have Project Management (at some level), Data Sets (source – destination), Resources (Developers, TPM’s, hardware, software) and More…

Now, Add the wrinkle of building the DW in the Cloud on AWS with Redshift, RDS, EMR, Hadoop, Data Migration and other services. It can be quite overwhelming and complex. Over the next few post, I will expand the list and deep-dive a few of the items and hopefully, give you and your project team some great answers.

If your starting an AWS Data Warehouse project and/or have questions; Please Email Us, We’ll get back to you ASAP.

Part III, Aggregations – Filters,  AWS Data Warehouse solution – A.

Data Sets – Data Model

  • Source Systems
  • Data Model : Definition of Core Facts – Dimensions
  • Hierarchies, Drill-Down, Drill-through
  • Aggregations –  Default
  • Default Filters – User Defined Filters
  • Extended Aggregations – UDF, UDAF [Hadoop, Hive]
  • Predictive Analytics – Data Mining
  • Business Intelligence Tools – AWS Compatibility
  • Visualizations
  • Reporting

Aggregations

Wiki Defines,  Aggregates are used in dimensional models of the data warehouse to produce dramatic positive effects on the time it takes to query large sets of data.  The simple way to picture an Aggregate is a summary of data. We Aggregate in speech when we paraphrase (sentiment analytics), We aggregate at work when we boil down our actions items …

In SQL (RDBMS – NOSQL), we will use things like Count, Average, Sum, Min, …etc and group by a set of properties like date or location to find sales by category by date, customers by product.  We typically aggregate FACTS [Metrics] from a fact table.  We typically use the dimensions to support hierarchies that allow us to support drill-throughs.  [Sales by Quarter (drill to) Month (drill to) Day (drill to) Hour .. Fact- Sales, Dim- Time].

Aggregates are important because they allow us to answer the questions most people ask before they ask them.  In a large enterprise data warehouse, at the start of the week or end of the month : sales and marketing always ask the same questions. It could be one manager or dozens asking, “How Much? Where? When?…”  If you have all the data pre-computed and available for everyone all the time – The system becomes extremely fast. That’s Aggregation.

Example: Count – Averages of IOT Metrics.

Select sensortype, count(*) 
From sensordata_load
Group By sensortype;
--RESULT
sensortype,count(*)
currentActivityLevel,75
currentBodyTemp,51
currentHeartRate,78
currentLevel,674
currentPresure,653
currentRespiration,48
currentTemp,709

Select sensortype, avg(metric)
from sensordata_load
group by sensortype;
--RESULT
sensortype,avg(metric)
currentActivityLevel,1.7704135637494671
currentBodyTemp,99.2531506178647
currentHeartRate,77.91549099988207
currentLevel,10.105113107425208
currentPresure,990.4420205285952
currentRespiration,16.913399179652078
currentTemp,43.30526224303287

In Hadoop, MongoDB (NoSQL), We do the same things using Hive and/or Spark with an abstraction to MapReduce. (i.e. you may be running a SQL Hive query – but that gets converted into a MapReduce job behind the scenes. So, if you know MR, you could bypass that step to speed things up.  More on Spark and RDD’s later).

Filters

Filters are simple. They are implemented to limit the amount of data you return -or- is searched.  This has big implication in Big Data and RDBMS systems. People will often say – Give me all the “Data” everywhere …but then just show me my “data”. Yikes, So – in the background all the servers in your cluster spin up when data probably only resided on a hand-full of locations.   If you slightly change the way you ask the question – You can really speed things up.

In SQL, The filter is the “Where” clause in SQL.  Things have gotten a bit tougher in Big Data because now we have things like shard keys, partition keys, search keys and they affect the “Where”.  When building a data warehouse – its the Architect and Developers responsibility to anticipate the types of questions folks will be asking and the types of filtering they will want by default and ad-hoc (on the fly).

Quick Where Example:

Where 
         region = "WEST COAST"
AND color IN ("RED","WHITE,"BLUE")
AND size IN ("S","M","L","XL")
AND product_category IN ("TRUCKS","CARS","TRAILERS")
...

Where
         device_type IN ("TEMPERATURE","PRESSURE")
AND installation_loc IN ("COOLING PLANT", "HOLDING TANK FARM", "MAIN AGITATOR1")
AND sensor_dt_stamp between (NOW) AND (NOW-18 months) 
...

A Best Practices is too brain-storm with your stake holders and limit data they would typically not search by default. Filter everything automatically. Your users will have to choose to turn that filter off.  A good example is limiting data sets to the last 18 months by default, limit data by a specific region (very important when building a DW for international users as it may be a legal requirement), limit data by IOT device type or installation (see examples). Talk about having an huge impact on your resource requirement (by not having to search through data that does not matter to your answer). I call these semantic optimizations.  Just a small change in code and process – can save time and money.  FYI, This is called “Default Filtering” and is often ignored out of fear of missing something.

Please Contact US or Subscribe to our Blog if you found this interesting or would like more information.

Subscribe : Blueskymetrics Blog

* indicates required,  Managed By Mail-chimp – Please check your Spam Folder and Confirm Subscription.