Knowing your Data and its types.

suraj j unni
7 min readJul 6, 2021

--

“Without data you’re just another person with an opinion.”

-Edwards Deming, Statistician

Lets discuss about something which is huge but once able to comprehend can make the action predictable. Yes, you are correct, this is about Data. I know most people have given thoughts about data and the way it is used in application. Here I will mention about the data and its types.

What is data?

You will find varying definitions but I would define data as the digital persistence of facts, knowledge, and information consolidated for
reference or analysis. The focus of my definition should be the word persistence because digital facts remain even after the computers used to create them are powered down and they are retrievable for future use.

Know Your Data (KYD)
Knowing your data is all about understanding the source technology that was used to create the data along with the business requirements and rules used to store it. Do research ahead of time to understand what the business is all about and how the data is used. For example, if you are working with a sales team, learn what drives their team’s success. Do they have daily, monthly, or quarterly sales quotas? Do they do reporting for month-end/quarter-end that goes to senior management and has to be accurate because it has financial impacts on the company? Learning more about the source data by asking questions about how it will be consumed will help focus your analysis when you have to deliver results.
KYD is also about data lineage, which is understanding how the data was
originally sourced including the technologies used along with the
transformations that occurred before, during, and afterward. Refer back to
the 3Vs so you can effectively communicate the responses from common
questions about the data such as where this data is sourced from or who is
responsible for maintaining the data source.

Voice of the Customer (VOC)
The concept of VOC is nothing new and has been taught at universities for years as a well-known concept applied in sales, marketing, and many other business operations. VOC is the concept of understanding customer needs by learning from or listening to their needs before, during, and after they use a company’s product or service. The relevance of this concept remains important today and should be applied to every data project that you
participate in.
During a tech talk at a local university, I was asked the difference between
KYD and VOC. I explained that both are important and focused on
communicating and learning more about the subject area or business. The
key differences are prepared versus present. KYD is all about doing your
homework ahead of time to be prepared before talking to experts. VOC is
all about listening to the needs of your business or consumers regarding
the data.

Understanding data types and their significance As we have uncovered with the 3Vs, data comes in all shapes and sizes, so let’s break down some key data types and better understand why they are important. To begin, let’s classify
data in general terms of unstructured, semi-structured, and structured.

Data Types

Unstructured data
A simple example of unstructured data would be an email message body that is classified as free text. Free text may have some obvious structure
that a human can identify such as free space to break up paragraphs, dates, and phone numbers, but having a computer identify those elements would require programming to classify any data elements as such. What makes free text challenging for data analysis is its inconsistent nature, especially when trying to work with multiple examples.
When working with unstructured data, there will be inconsistencies
because of the nature of free text including misspellings, the different
classification of dates, and so on. Always have a peer review of the
workflow or code used to curate the data.

Semi-structured data
Next, we have semi-structured data, which is similar to unstructured, however, the key difference is the addition of tags, which are keywords or any classification used to create a natural hierarchy. Examples of semi-structured data are XML and JSON files, as shown in
the following code:
{
“First_Name”: “John”,
“Last_Name”: “Doe”,
“Age”: 42,
“Home_Address”: {
“Address_1”: “123 Main Street”,
“Address_2”: [],
“City”: “New York”,
“State”: “NY”,
“Zip_Code”: “10021”
},
}
This JSON formatted code allows for free text elements such as a street address, a phone number, and age, but now has tags created to identify those fields and values, which is a concept called key-value pairs. This key-value pair concept allows for the classification of data with a structure for analysis such as filtering, but still has the flexibility to change the elements as necessary to support the unstructured/free text. The biggest advantage of semi structured data is the flexibility to change the underlining schema of how the data is stored.
The schema is a foundational concept of traditional database systems that defines how the data must be persisted (that is, stored on disk).
The disadvantage to semi-structured data is that you may still find inconsistencies with data values depending on how the data was captured. Ideally, the burden on consistency is
moved to the User Interface (UI), which would have coded standards and business rules such as required fields to increase the quality but, as a data analyst who practices KYD, you should validate that during the project.

Structured data
Finally, we have structured data, which is the most common type found in databases and data created from applications (apps or software) and code. The biggest benefit with structured data is consistency and relatively high quality between each record, especially when stored in the same database table.

Common data types

Data types are a well-known concept in programming languages and is found in many different technologies. Here in the succeeding table I have created summary of some common data types.

Data classifications

In the preceding diagram, the boxes directly below data have the three methods to classify data, which are continuous, categorical, or discrete.
Continuous data is measurable, quantified with a numeric data type, and has a continuous range with infinite possibilities. The bottom boxes in this diagram are examples so you can easily find them for reference. Continuous data examples include a stock price, weight in pounds, and time.
Categorical (descriptive) data will have values as a string data type. Categorical data is qualified so it would describe something specific such as a person, place, or thing. Some examples include a country of origin, a month of the year, the different types of trees, and your family designation.

A discrete data type can be either continuous or categorical depending on how it’s used for analysis. Examples include the number of employees in a company. You must have an integer/whole number representing the count for each employee, because you can never have partial results such as half an employee.

Data attributes
Now that we understand how to classify data, let’s break down the attribute types available to better understand how you can use them . The easiest method to break down types is to start with how you plan on using the data values :
Nominal data is defined as data where you can distinguish between different
values but not necessarily order them. It is qualitative in nature, so think of
nominal data as labels or names as stocks or bonds where math cannot be
performed on them because they are string values. With nominal values, you
cannot determine whether the word stocks or bonds are better or worse without additional information.
Ordinal data is ordered data where a ranking exists, but the distance or range
between values cannot be defined. Ordinal data is qualitative using labels or
names but now the values will have a natural or defined sequence. Similar to
nominal data, ordinal data can be counted but not calculated with all statistical methods.
Interval data is like ordinal data, but the distance between data points is
uniform. Weight on a scale in pounds is a good example because the difference between the values from 5 to 10, 10 to 15, and 20 to 25 are all the same. Note that not every arithmetic operation can be performed on interval data so understanding the context of the data and how it should be used becomes important.

Ratio data allows for all arithmetic operations including sum, average, median, mode, multiplication, and division. The data types of integer and float discussed earlier are classified as ratio data attributes, which in turn are also numeric/quantitative.

Time data attributes as a rich subject that you will come across regularly during your data analysis journey. Time data covers both date and time or any
combination, for example, the time as HH:MM AM/PM, such as 12:03 AM.

Conclusion

This is just a glimpse about the data and its types. With this information you can start collecting data and try to segregate with the various data types. In the upcoming blogs, I will be tell about process used for data transform and its applications.

I am mentioning some sites to get data:

Kaggle: https://www.kaggle.com/datasets
FiveThirtyEight: https:/​/​data.​fivethirtyeight.​com/​
The World Bank: https:/​/​data.​worldbank.​org/

--

--

suraj j unni
suraj j unni

Written by suraj j unni

Learning to break my barriers.

No responses yet