Friday, February 19, 2021

FAQ Series - When and How to Scaffold Data


Data structure has a huge affect on the results you get from your analysis. Like most of you, I trained using well structured and full data sets - where there was a value in every record (cell) - like a tightly fit puzzle

  



With real world data that is often not the case. Data can come from ERP systems, stand alone third party or public sources and contains voids or is sparse or otherwise badly structured resulting in gaps in the data.




Scaffolding is one way to re-structure the data and fill the gaps.  


We are going to look at 3 different frequently encountered use case where data needs to be scaffolded to solve the problem.  The cases are adapted from actual questions that were posted on the Forum - data and names changed

  






1 - The 2 Date Problem

 

The '2 Date Problem' is a classic case - each record in the data has a start date and end date and you need to look at all the periods in between.  It comes up when you need to count the active programs, total cash flow over a series of mortgages or annuities, or, as in this example count head count over time.

The question came from an HR manager who had a number of open positions. Each had an "open date", some had been filled and others were still open.  The manager wanted to count all the open positions and separately those filled by month.  

The data looked like this:


 

Each requisition had start and end dates as independent values but there was no calendar common to every record.  The solution was to build a calendar using a scaffold. In this case, the manager was only interested in monthly totals so a monthly date ladder would suffice: 


The scaffold and the base data were joined so that every record in the base data was connected to each record in the scaffold - a "Cartesian join"  - with a 1=1 join clause



Using the scaffold date resulted in a calendar that included each job requisition


All that was need were 2 LOD expressions to count the filled or open jobs by month

the filled LOD is 


{ FIXED [Job Ref ID1] , [Scaffold Date]:  countd(  if  not ISNULL([Job open date]) 

and DATETRUNC('month',[Job filled date])=DATETRUNC('month',[Scaffold Date])

then [Job Ref ID1]  end )}


and the open LOD is 

{ FIXED [Job Ref ID1] , [Scaffold Date]:  countd(    

if DATETRUNC('month',[Job open date])<= DATETRUNC('month',[Scaffold Date]) 

and ( DATETRUNC('month',[Job filled date])>DATETRUNC('month',[Scaffold Date])

or ISNULL([Job filled date]))then [Job Ref ID1]  end )}


and the result looked like this



By using the date scaffold to create a calendar, the solution was reduced to just checking each month to see which positions were still open and which had been filled.


2 When drilling down breaks the analysis

Data that appear to be solid a high levels of aggregation can be sparse or have voids when drilling down a dimension hierarchy or to lower date levels.  The problem is common with data that comes from ERP systems (like Oracle or SAP).  ERP systems are transaction driven and only create a record when a transaction takes place e.g. if there are no sales for a particular SKU on a date there is no record in the data set.  

To illustrate the issue we will look at an inventory analysis where a production planner wanted to calculate a very common metric used as an indicator that inventory levels were either too high or low.  

For those not familiar with EPR systems and the forecasting to production process, a quick review. The ERP system records sales directly from the sales invoice. That data is precise - we know the customer, the date, the SKU (stock keeping unit), the unit quantity and the dollar value.  The data can then be aggregate along several different hierarchies - here along the product and is still very accurate although somewhat less useful 




Forecasters and product managers will use the sales history to create a rolling forecast that is used by the production planning and purchasing departments to  develop the production schedule. The process starts at the top on the pyramid and works downs, each level being a lower level of aggregation that is more useful to the planner but less accurate than the preceding 

Forecast are usually adjusted monthly along a 3 or 4 month forecasting horizon.  The production schedulers take the lowest level -Units by SKU by week and schedule production, purchase components, and plan inventory levels. While the forecasting and planning process are numbers in the system and the results become less accurate at lower levels in the pyramid, actual production and inventory levels are precise.  You know exactly how many units were produced for each SKU and the inventory levels.  

With the uncertainties in the process, planners track a metric as an early warning when the forecast is too high or the inventory too low so they can adjust the production schedule. The metric is Days on Hand (or days Sales, or just Days). The calculation is just 

Sum(Inventory level) / Average daily sales

That is where the user question fits in.  He was an production planner manager and was planning a single item, had 4 customers to consider and was pulling sales, inventory and production data from their ERP system. The data looked something like this with a separate file for each customer:




He was getting different Days on Hand results that could not be reconciled for individual customers or across the group as a whole.

After filtering the results to show only the last date and they looked something like this across all 4 companies:


But some results look odd


So lets see what is going on:

First the formula for the Seven Day Avg Sales is  window_avg(sum([Sales]),-6,0)
and that is a clue to the first problem - Window average is a table calculation that looks at the last 7 records, not days but 
this is what the actual data looked like when all the records were visible 



Many Nulls - very sparse data  and it is clear that for Customer D there was only 1 sale in the final 7 days and that was for 1000 units so the average did not take into account the nulls. The nulls in the data were "no record null" see Null Types and a scaffold is needed to be used to force a record into the data for each combination of customer name and date.

Tableau Prep was used to force the record into the data - the first step joins the customer list to the date list (Cartesian)


The second joins the original data to the ladder 




after scaffolding the last seven data data looks like this ( note zn() functions and previous applied ) - and the data now runs through the end of February


and the last day summary across the 4 companies is


The actual Days on Hand when the missing data has been include is very different than the original results. This is a case where the planner (the Subject Matter Expert) would not know what the data structure looks like nor should he - that is the role of the dashboard designer and data scientist - It is also a by product of the data source - the ERP system that could affect your next analysis


3 Voids in the data

Voids, single or small groups of no records nulls in the data, will return a null (blank screen) if the unsuspecting user filters to that combination of dimensions.  
The user had created a viz using small multiples - a great way to see the changes across all values of a dimension in a single view  - here using Superstore data (note there are 17 images one for each sub-category in the data set)



When the date is filtered to on February 2016, 4 images disappeared (returned a null) 



The user had 2 questions - Why did it happen and then how to correct the problem

When the user filtered to February there are no records in 4 subcategories - the voids returned Nulls causes the 4 image to "disappear"


 Here using a scaffold to force a record into each void will solve the problem.  

I used Tableau Prep to create the scaffold. First Cartesian join the date and subcategory fields



Then connect that to the Superstore dataset




But scaffolding exploded the data set to 28,000 records - at the same time it did solve the users problem




Sometimes it is necessary to think beyond calculations and functions to create the viz you want to present and then it is often that the data need to be re-structured. When the data contains voids or is missing records, scaffolding is an option to make the data set solid.


Hope this helps you better understand when and how to use different forms of scaffolding.  The workbook used here can be downloaded at Link to Examples

Enjoy
Jim

also see the video presentation on VizConnect at Video at

Thursday, December 10, 2020

My Top 5 Building Blocks on the Journey from Spreadsheets to Tableau

(Note: Originally presented on #Vizconnect - see the recording at https://www.youtube.com/watch?v=gZoVzJT3pZ4)

Presented at  TFF 2020 see  Video Here

Like many of you, I am not a data scientist. Actually, my background is in engineering and marketing.  I started Tableau with a lot of experience with Excel mixed with a bit of Access and I struggled for a while - that may sound familiar.  

Additionally, spreadsheet calculators and Tableau are fundamentally different. Spreadsheets work on a cell basis - values or formulas are in cells - cells can be copied or referenced in formulas to return another single cell value.  Tableau works with dimensions that categorize data and measures that are actual values. Referring to dimension returns all the associated values in the data base - an entire column of values in a spreadsheet - 




Ok, so what - it meant learning Tableau from the ground up and along the way there were 5 fundamental building blocks that once learned could be combined to create even the most complex solutions.  Below I will present the skills needed and a couple of examples.   Try them out with your own simple data sets. You can't break anything and you will learn a lot in the process.



I grew up with a calendar on the refrigerator door - 7 columns for days and 4 to 5 rows for weeks and got comfortable seeing data organized in short wide tables.  Spreadsheets have a similar structure so learning that Tableau needs a tall narrow data table structure is out of our comfort zone.  For example, spreadsheet column headers like dates need to be in a single dimension (column) with each value a different date




The way that is done is to pivot the data on Data Source tab




resulting in the tall narrow data structure needed 




Next you need to understand how to combine data sets - using Joins to add a column or Unions to extend the length of the data 




When you need to add another "column" of data to the table use a Join . In Excel you would use a vlookup 

To add Quantity data to the Sales data above with data that looks like this:



The Quantity data need to match Sales by Date and by the Customer - so there are 2 join clauses 



Alternatively,  data that is updated regularly, like monthly Sales, using a Union of the new month to the existing data will append the new data to bottom of the existing




There is one more way to combine data that are at different level of aggregation or come from different data sources - that is Blending -  

Load your first data source into Tableau then add a second source




Those are basic examples of 3 ways to connect data -  it can be more complex but with practice you will become comfortable selecting which to use  - in later examples you will see how they affect your calculations


 The order of operation is the sequence Tableau follows as it creates a viz.

For each worksheet, Tableau will use a subset of the total data set to create a table that "underlies" the viz. You can think of it like a spreadsheet - its not really - it is the tall narrow format.  Once the dimensional table is built - values are added and manipulated to create the final chart, graph or visual



While there are 10  steps in the order of operation, I find it useful to think of them in 3 groups



The first 2 steps are workbook level and filter the data from the data source - they can be used to improve the performance of the book.  

As dimensions are added to the Rows and Columns shelves and filter applied  (Steps 3-5) the dimensional table for the worksheet is formed 

The final 5 steps load and manipulate the values in the sheet.

A step by step explanation is too exhaustive for this post but is covered in Orderofoperation  on this blog.   

When I first started, I printed (yes on paper) a copy of the order of operation as a reference. Next to syntax mistakes, using steps out of order is the most frequent cause of errors.


While some of the formulas are similar to those in Excel, Tableau syntax is unique  and needs to be learned.  Fortunately, there is a Calculation Wizard to help

If you have written a lot of Excel formulas the wizard user interface (U/I) will look familiar.  Formulas are listed and grouped in categories.  A drop down can filter the formula list making specific functions easier to find.  On the right is an explanation of the formula you are working on and the arguments needed


Autofill and auto suggest are also included -  at the bottom of the frame is the a Syntax is checked as you enter the formula.


The syntax checker will tell you when the formula is valid - that does not mean the logic returns the answer you expected .  

The most common types of errors are due to aggregation / non aggregation, trying to aggregate something that is already aggregated, mis-matched parentheses, or missing an operator -I still make them everyday -  after a while you know how to correct them - see aggregation to learn more about aggregation errors

You didn't learn Excel formulas overnight and the same is true for Tableau - you will find there are a few you use repeatedly that will be like second nature and others that you need to look up. 

 

Calculations are the backbone of creating more advanced analyses and the visuals that go alone with them.  

There are 4 basic types of calculations in Tableau


Simple calculations, including "If ... Then " conditional statements look similar to those in excel except they are typically written at an aggregated level - Think back to the data structure, Dimensions and Measures refer to and entire "column" of Excel data - 


so to find the average selling price at the level of dimensions in the viz -

ASP=Sum(sale)/Sum(quantity) 

Will return this when Category is on the Rows


and this



Conditional statements are coded as     If ... Then ... Elseif ... End 
 
This statement would create Sales Groups out of Regions

If [Region]="East" or [Region]= "West" then "Costal"
Elseif [Region] = "Central" then "Middle" 
else "South" end

and return this


Conditional statements can be nested or used in any other type of calculation

The more you write them the more comfortable you will become with them

Dates functions - real dates - are special type of data in the Tableau - they self sequence, sort and most of all can be used in date calculations.  The functions can be discrete or continuous and are based on truncating the full data time series at different levels




Some of the date functions - year, month, quarter, day seem familiar - others like datetrunc, dateadd, datediff will take some getting used to see date functions for more

Like Excel, date functions come in 2 different types

  • Those that return dates - Datetrunc in Tableau or Date Value in Excel
  • and those that return a reference to the date like Month in both systems
Date functions can be used with time also. They are very flexible but do take some practice.  I have several post on converting strings to dates, calculating durations, and excluding weekend to use "Business Days"  - see Date Examples

LOD Expressions - Extremely powerful, LOD's allow you to create additional layers in your data are level above that entered when you loaded the data


While LODs aggregate a measure, they are not aggregates in themselves and can be used in simple calculations, nested in other LOD's, and included in Table Calculation

{Fixed Segment, Region :  Sum(Sales)}

In words says take all the combinations of Segment (4) and Region (4) and total the Sales anew save them for use them later - and you have established a lay in the data set that includes 16 values. 

Table Calculations   With a background in Excel, table calculations will seem familiar.  Like Excel formulas, they operate on the rows and columns of the data table that underlies the worksheet but that is where the similarity ends.

In Tableau there are 2 concept - Scope and Direction that govern how the calculation is applied to the data in the table 


Table calculations come in 3 types

    • Those that aggregate or rank values in the table like 
      • Window_Avg, Running_Sum, Rank_Unique
    • Others used to navigate the table like:
      • Lookup, Previous_Value, First, Last
    • Act as the interface to R or Python
      • Script_Real, Script_Int

You can use the Quick Table Calculator 


Or you can write your own using the Calculation Wizard

The Direction for the calculation will default to Across but can be changed directly using "Compute Using" or opening the Editor 





Filters, sets and parameters give you and the user ways to select, limit and group data.  Just like in Excel, filters limit the data in the underlying table, Data that is filtered out is no longer available for calculations or presentation on the worksheet. Filters can be applied to the workbook level, to dimensions or to measure values after they are added to the table


Measure filters can be applied at the record before aggregation or the after selecting an aggregation level


Sets classify dimensions into 2 groups - the "IN" group selected by the user and all other values are in the "OUT" group. All records are still available in the table allowing comparisons between the In and the Out groups. Sets can be selected manually or by a formula and multiple sets can be  combined to identify items in both or only a single group.  Set actions can be used to change set members by selecting from a value picker or dropdown.  Finally sets are formed in Step 4 of the Order of Operation after Context filters have been applied but before Dimensional filters - giving the user several options when comparing values.  

Parameters allow the user to input a single static value by a selection from a list or direct entry into the workbook.  The value is constant until manual changed and is the same throughout the entire workbook.  

But parameters don't do anything until they are included in a calculation (any type) or a filter.  It is the formula or the filter that does affects the viz not the parameter.  The parameter value is the same everywhere in the workbook but it can be used in different calculations or filters and need not be applied to every worksheet.  Changing the parameter value will affect all the calculations or filters where the parameter is used.  Parameter actions can be used to change the parameter value visual elements on the worksheet



The parameter list can be updated dynamically when the workbook is opened - 




Those are the 5 building blocks on the road to understanding Tableau. Combined simple calculations were used to create the viz below.  It uses a parameter action to select the date, set action to select states and segments, LOD's for YTD and YoY calculations 


It may seem formidable but with a little work on each you will develop a deep understanding of how Tableau works.

The workbooks that accompany these examples can be found at Tableau Public Link



Jim


 


 


 






FAQ Series - When and How to Scaffold Data

Data structure has a huge affect on the results you get from your analysis.  Like most of you, I trained using well structured and full data...