In this blog, we are going to learn how to do exploratory data analysis with AWS machine learning. Machine learning is not much use without careful data analysis where we understand the underlying data trends and patterns and data preparation where we fix the issues we found in our data using data preparation techniques.
This blog post cover:
- What Is Data Analysis?
- What Is Data?
- Usage Of Data Analysis
- Types Of Visualizations
- Why Data Preparation?
- What Is Amazon QuickSight?
- Amazon SageMaker Data Wrangler
According to “The State of Data Science 2020” study, exploratory data analysis (EDA), data management, feature selection, and feature engineering accounts for more than 66% of a data scientist’s time.
What Is Data Analysis?
The data analysis process involves many steps on the data, such as cleansing and transformation, in order to find out what hidden information the data has and how this information can help us in our decision‑making process.
Read More : About aws dms ( Amazon Database Migration Service )
What Is Data?
In our real world, data exists in many different formats. It can be for ages, colors, temperature, audio, text, etc. we can categorize the data into four main categories, numerical data, which represents numbers, categorical data, which represents different classifications of data like colors, which can be red or blue, unstructured data, which does not follow a usual order, and time data.
Usage Of Data Analysis
- Data Analysis is to make it easier to comprehend complex datasets that could consist of millions of entries.
- The second benefit in the machine learning context is understanding how algorithms are performing and analyze and assess them using different visualizations.
- Finally, data analysis makes it easy for us to spot potential relationships in our dataset.
Read More : About AWS Certificate Manager click here
Types Of Visualizations
- Comparison Visualizations: It helps us compare values across a certain dimension. Types of comparison visualization are:
- Bar Chart
- Line Chart
- Relationship Visualizations: It helps us detect potential cause and effect relationships in our dataset. Types of relationship visualization are:
- Scatter Plot
- Heat Map
- Composition Visualizations: It consists of multiple graphs of the same type with a goal to convey complex information. Type of Composition visualization are:
- Pie Chart
- Distribution Visualizations: It helps us understand the underlying data distribution. Type of distribution visualization are:
- Histogram
Check Out: What is Amazon Rekognition? Click here
Why Data Preparation?
Before hand over, to the machine learning team, our data will most likely have many issues that prevent using it directly. Some of these issues are:
- Imbalanced dataset: It means that we might not have representative samples from all real cases of our problem domain. This is particularly important for classification problems.
- Different scales: our data might use different scales, which definitely means that we will have to make sure that we are using the same scales everywhere so that we compare apples to apples.
- Inconsistent formats: corruption in some sensors, we read the data from.
- Difficult presentation: Our data might not be straightforward numerical data that the machine learning models can directly consume, it can be audio files or even categorical data that require special processing.
- Missing values & Outliers: We might have missing data due to optional fields or even system failures. Or even worse, our data might contain some outliers that are not representative of the real problem domain.
- High dimensionality: our data is highly dimensional, that is, it has too many features which makes it difficult for us to visualize and train.
- Highly correlated features: Our data might also expect what so‑called features with high correlation, which are features that add no value to our machine learning model, or even worse, it can make our regression tasks perform worse.
- Malformed distribution: our data distribution might be malformed and not what the machine learning algorithms expect.
What Is Amazon QuickSight?
It’s a business‑oriented data visualization tool meant for business nontechnical users. You can think about it as the equivalent of Microsoft Power BI in the AWS space. It has the following features:
- It has affordable pricing.
- It is scalable across large numbers of users.
- The datasets are processed using the SPICE engine, which is a superfast, parallel, in‑memory calculation engine that makes our data visualization processes high performant.
Amazon SageMaker Data Wrangler
Amazon SageMaker plays an important role in Exploratory Data Analysis with AWS ML. It is the quickest and simplest way to prepare data for machine learning. It gives us the ability to use a visual interface to access data, performs feature engineering and EDA, and seamlessly operationalizes your data stream by exporting it into an Amazon SageMaker Data Wrangler job, Amazon SageMaker pipeline, Python file, or SageMaker feature group.
SageMaker Data Wrangler provides a selection of 300+ pre-configured data transformations, such as one-hot encoding, convert column type, impute missing data with mean or median, rescale columns, and data/time embeddings, so you can mold your data into formats that can be definitely used for models without writing a single line of code. For example, you can disciple a text field column into a numerical column with a single click, or author custom conversions in PySpark, SQL, and Pandas.
Related References
- AWS Certified Machine Learning Specialty: All You Need To Know
- Introduction To Amazon SageMaker Built-in Algorithms
- Amazon Rekognition | Computer Vision On AWS
- AWS Database Services – Amazon RDS, Aurora, DynamoDB, ElastiCache
- Amazon Kinesis Overview, Features And Benefits
Next Task For You
If you are also interested and want to more about the AWS certified Machine Learning Specialist then join the Waitlist.
Leave a Reply