Ultimate Guide to Exploratory Data Analysis

Ultimate Guide to Exploratory Data Analysis

Definition

Exploratory Data Analysis is an approach to analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods, according to Wikipedia.

It assists us to identify potential issues with our dataset i.e. missing data, outliers understanding the nature and type of variables, understanding the relationship between the variables and effectively communicating our findings. This helps in advising the company during decision-making as it will be a data-driven approach.

Exploratory Data Analysis Process

The EDA process is similar in all the data science programming languages i.e. R, and Python. The process involves three major steps:

  • Data Input/ Reading

    This involves assigning your data to a programming language object so as to store the data in memory. Each data source has its own way of data input/handling. Example in R and Python:

      import pandas as pd
      df = pd.read_csv('filepath/data.csv')
    
      df = read.csv('filepath/data.csv')
    

    The method to use is dictated by the data size and format in which it's stored.

  • Data Cleaning and Analysis

This involves getting a deeper understanding of the data you imported. It will involve:

  1. Identifying missing data points/ data

    During data collection, some respondents tend to skip questions or sometimes pick an option and do not give the required input. This tends to bring missing data which would have given better insight. EDA assists in identifying the missing and the size of missing data. The most common name is null or NaN variables.

  2. Identifying outliers

    Outliers are data points that are significantly different from other data points in the dataset. We tend to identify them through quantiles or percentiles. Most outliers are considered to be not between the (0.1, 0.9) quantile. . The best action always involves dropping the outliers. Example in Python

     #subseting to get the borders
     low, high = df['column'].quantile([0.1,0.9])
     #assigning to series
     col = df['column'].between(low,high)
    
  3. Distribution of the data

    We query the spread, shape, and central tendency (mean, standard deviation, variance e.t.c) of the data to get insights into the data distribution. This helps us in making decisions on what statistical test or analysis to use, and also helps in checking data skewness (unbalanced data).

  4. Statistical Analysis

    A method to perform statistical analysis depends on the data distribution and data types. It involves performing correlations between variables and this aid in understanding concepts such as multicollinearity between the variables.

    We also perform statical tests on the data to assist us to answer the hypothesis or objectives we had created.

    We can also perform regression analysis on the data set to get more insights into our response variable and independent variables.

Data Visualization

This is considered the last process of Exploratory Data Analysis.

We visualize our data to identify patterns and trends(time series) which would be difficult with raw data.

It also involves communicating our results in a simple language understandable by the policymakers. We compile our results in pictorial format to highlight major insights and show the data-driven recommendations.

The most common data visualization packages are ggplot2 (R), matplotlib(Python), Seaborn(Python), and plotly(R and Python).

Importance

EDA significance can be majorly classified as:

  1. Data Quality and Better Understanding

    It helps in having a better understanding of the data you are working on to identify trends, patterns, or outliers, which are considered anomalies. This aids in planning how to analyze the data and interpret it.

    Identifying missing values helps in ensuring we use reliable data.

  2. Communication

    The use of visualization and summaries aids in presenting our results. This makes the results more understandable to most people.

  3. Decision Making

    EDA can aid in making decisions backed up by data. This gives policymakers a chance to make informed decisions that tend to be more effective and achievable.

Conclusion

EDA is a critical process in data processing in order to get insights about data sets. The insights aid in making data-driven decisions which tends to be effective.

It's essential for a data analyst to master EDA in order to assist policymakers to understand the past, current and best practices for future events.

I hope this ultimate guide serves as a valuable resource for anyone looking to improve their EDA skills.

You can check a sample EDA procedure in this GitHub repository and never feel shy about asking for guidance.