The story of Data

Data originates from a source as something pristine or fresh. Then it gets tagged. For example, name, voltage, etc. Now this tagged information is typically stored in a file or a database. Thus, you may have a file with sensors information like current, voltage and power (watts). For a moment, assume that this is a sensor attached to your refrigerator’s compressor. On the other hand, data may be created by a human being, say adding customer information in a CRM system. This may be the name, telephone number, address and zip code. All these data (in fact the right word is information, as we are dealing with processed data) sits snugly in a database somewhere, maybe not in the location where the data got produced. It may be on-prem (on the infrastructure of the user) – typically in a company’s private datacentre or it may be on the cloud. It doesn’t matter as long as the data is captured as is and stored in the database, without any loss.

If you consider a company, there are many departments which have their own databases (typically) and maybe they run their own programs. Now, this is called silo. There are many silos (individual) of data and possibly programs within the company. In short, we don’t have a unified view of the organization. This is the reason why companies buy ERP systems – one single system which can span over multiple data sets (departments) This is OK for a small or medium business. The problem starts when the scale of your operations. The servers in your company may be handling small loads, but when the business grows, you require more data, to do analysis (descriptive, diagnostic, predictive, prescriptive) and get an edge over the competitors. This means that the company not only consumes data produced by itself but looks for gleaning information from outside. Now data is no longer about servers, but we are talking about Data as a Service. (DaaS)

Typically, companies use a combination of data stored in their private premises and on the cloud. This is referred to as a hybrid cloud. All sensitive data (regulatory or otherwise) are stored in their private cloud and the processes and data that can be outsourced are stored in a public cloud. Now, data comes in two flavors – static and streaming. Static data is that which is lying in your data warehouse (say past 20 years data) and streaming data is also referred to as data in motion. As a consumer of data, we may have to look at both types together. This data must be cleaned and wrangled or munged so that it becomes production ready. Now, this is typically the job of data science personnel. They use ETL (Extract, Transform and Load) to achieve their objective. Maybe they are using automated tools or simply plain SQL. There are tools like Alteryx, Trifacta, etc. which help them do this task.

Sometimes data is schema-less, on-read, instead of on-write (Hadoop) and such data also has to be taken into account. The old concepts of MDM (Master Data Management) and ERD (Entity Relationship Diagram) are going away, as data types are becoming more and more complex. Instead, the new mantra is to automate as much as possible. It’s very difficult to clean data which has multiple attributes and dimensions. Hence, we have to bring AI (Artificial Intelligence) algorithms into the picture. Most of these data are

  • Deterministic
  • Probabilistic
  • Humanistic

Algorithms should get better at cleaning up data and from data consumption to data unification and finally date serving, over the last mile. (which is visualization) Every bunch of data tells a story. And we must create one for use by other people (in say other departments). Integrating data across silos and fresh perspectives of wisdom from the data is the order of the day.  So, make it worth the while.

God Bless !

Go back to other blogs

Techno Spiritual Entrepreneur with over 30 years of experience in the IT industry. Author of 5 books, trainer and consultant. Seeker of the truth - inclined towards spirituality and technology. Also, love to read and write inspirational stuff.