Inspirational journeys

Follow the stories of academics and their research expeditions

Real-Time Streaming Pipelines for Unstructured Data: An AWS End to End Data Engineering

Yusuf Ganiyu

Thu, 17 Apr 2025

Real-Time Streaming Pipelines for Unstructured Data: An AWS End to End Data Engineering

In the digital era, where data is likened to oil, the ability to efficiently process and analyze unstructured data in real-time is becoming increasingly crucial for businesses to maintain a competitive edge. The advent of technologies such as Apache Spark and cloud services like AWS has revolutionized data engineering, enabling the handling of diverse data types — text, images, videos, CSVs, JSON, and PDFs — across myriad datasets. This article delves into the intricacies of building real-time streaming pipelines for unstructured data, offering a roadmap for data engineers and IT professionals to navigate this complex landscape.
 

The Challenge of Unstructured Data

Unstructured data, lacking a predefined data model, represents the majority of data available in the digital universe today. From social media posts and digital images to emails and PDF documents, unstructured data is rich in information but complex to process and analyze due to its non-standardized format. The challenge lies not only in processing this data efficiently but also in doing so in real-time to extract valuable insights that can inform business decisions, enhance customer experiences, and drive innovation.

structured data vs unstructured data

Use case: Job Posting Data

Real-time streaming pipelines have emerged as a powerful solution to the challenges posed by unstructured data. These pipelines enable continuous data processing, allowing businesses to analyze and act upon data as it is generated. Unlike batch processing, which processes data in chunks at scheduled times, real-time streaming provides instantaneous insights, a critical advantage in applications such as fraud detection, social media monitoring, and live customer support.

Job postings are inherently diverse and complex. They contain a mix of structured data (such as employment type, experience level, and location) and unstructured data (including job descriptions, requirements, and qualifications). The unstructured nature of much of this data makes it difficult to categorize and analyze using traditional data processing methods. Real-time streaming pipelines offer a solution, enabling the continuous analysis of job postings as they are published, providing immediate insights into trends, skills demand, and market needs.

In the competitive job market, both employers and job seekers benefit from immediate access to information. Real-time streaming pipelines can significantly enhance the job posting and application process in several ways:

  1. Instantaneous Job Posting Updates: Employers can update job postings in real time, and changes are immediately reflected across all platforms. This ensures that potential applicants have access to the most current information, including job descriptions, requirements, and application deadlines.
  2. Dynamic Applicant Matching: By analyzing job seeker profiles and behaviors in real time, companies can dynamically match candidates with new job postings as soon as they become available. This not only improves the candidate’s experience by presenting them with relevant opportunities instantly but also helps employers quickly identify and engage with potential hires.
  3. Real-Time Feedback and Analytics: Employers can receive immediate feedback on the number of views, applications, and interactions with a job posting. This data can be used to adjust recruitment strategies on the fly, such as modifying job descriptions or requirements to attract a broader pool of candidates.
  4. Enhanced Candidate Experience: Real-time communication channels can be integrated into the job application process, allowing for instant notifications, updates, and interactions between employers and candidates. This immediacy can improve the overall experience for job seekers, keeping them engaged and informed throughout the application process.

How feasible is this, you asked? Let’s look at these examples:

Example 1: Identifying Emerging Skills in Tech Industry

Scenario: A tech company is looking to stay ahead of the curve by identifying emerging skills and technologies in the software development sector. Traditional methods of analyzing job postings often result in lagging indicators, as the data becomes outdated by the time it is processed.

Solution: By implementing a real-time streaming pipeline, the company can analyze job postings from various tech job boards and company websites as they are published. Using natural language processing (NLP) techniques within the pipeline, the system can extract key skills and technologies mentioned in the unstructured text of job descriptions and requirements.

Outcome: The company identifies a rapid increase in demand for skills related to artificial intelligence (AI) and machine learning (ML), as well as a growing interest in Rust programming language. This insight allows them to quickly pivot, offering training sessions to their existing staff and adjusting their recruitment focus to attract talent with these emerging skills.

Example 2: Real-Time Labor Market Analysis for Educational Institutions

Scenario: An educational institution aims to align its curriculum with the current job market demands to improve employability outcomes for graduates. Traditional curriculum review processes are slow and often rely on outdated job market data.

Solution: By leveraging a real-time streaming pipeline to analyze job postings across various sectors, the institution can gain immediate insights into the skills and qualifications most frequently demanded by employers. This analysis includes parsing complex job descriptions and extracting information on required qualifications, preferred skills, and industry certifications.

Outcome: The institution notices a significant demand for digital marketing skills across multiple industries, including certifications in Google Analytics and AdWords. In response, they quickly develop and introduce a new digital marketing module into their business and marketing programs, ensuring their graduates are well-equipped for the current job market.

Example 3: Adaptive Recruitment Strategies in Competitive Markets

Scenario: A recruitment agency specializes in placing candidates in highly competitive sectors, such as finance and technology. They struggle to keep up with the rapidly changing job market, often relying on intuition rather than data to advise clients on recruitment strategies.

Solution: Implementing a real-time streaming pipeline allows the agency to continuously monitor and analyze job postings from a wide range of sources, including niche job boards and financial news outlets. The system uses text analytics to identify trends in job titles, required experience levels, and competitive salary ranges.

Outcome: The agency identifies a growing trend for roles in financial technology (fintech) startups, with a particular emphasis on cybersecurity expertise and compliance knowledge. Armed with this information, they advise their clients to highlight these skills in their job postings and offer competitive salaries to attract top talent. Additionally, they guide candidates to focus on these growing niches, enhancing their placement success rate.

Example 4: Global Talent Acquisition for Multinational Corporations

Scenario: A multinational corporation with operations across several countries needs to understand global talent trends to standardize and optimize its hiring practices. The diversity of job markets and the volume of data make it challenging to gather actionable insights.

Solution: A real-time streaming pipeline, equipped with the capability to process and analyze job postings in multiple languages, enables the corporation to monitor global job trends. Advanced analytics and machine learning models are used to categorize data by region, industry, and job function, providing a comprehensive view of global talent demands.

Outcome: The corporation discovers a high demand for project management professionals in its Asian markets, contrasting with a focus on creative roles in marketing within North American regions. This insight allows them to tailor their recruitment strategies and job postings to match the regional demands, improving recruitment efficiency and effectiveness.

For folks interested in the video walkthrough, you can follow through here:

Project Setup

We setup our spark clusters on docker by having the master worker architecture all inheriting and implementing the bitnami/spark:latest image from docker hub in the form of:

version: '3'

x-spark-common: &spark-common
  image: bitnami/spark:latest
  volumes:
    - ./jobs:/opt/bitnami/spark/jobs
  command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://spark-master:7077
  depends_on:
    - spark-master
  environment:
    SPARK_MODE: Worker
    SPARK_WORKER_CORES: 2
    SPARK_WORKER_MEMORY: 1g
    SPARK_MASTER_URL: spark://spark-master:7077
  networks:
    - datamasterylab

services:
  spark-master:
    image: bitnami/spark:latest
    volumes:
      - ./jobs:/opt/bitnami/spark/jobs
    command: bin/spark-class org.apache.spark.deploy.master.Master
    ports:
      - "9090:8080"
      - "7077:7077"
    networks:
      - datamasterylab
  spark-worker-1:
    <<: *spark-common
  spark-worker-2:
    <<: *spark-common
  spark-worker-3:
    <<: *spark-common
  spark-worker-4:
    <<: *spark-common

networks:
  datamasterylab:

Spinning this up, we get something like this in the terminal as well as our docker dashboard.

docker compose up -d

docker compose for spark master worker architecture

spark master worker architecture docker dashboard

Now that our architecture is setup and ready to accept job submissions, it’s high time we write our realtime streaming spark jobs to be submitted to the cluster.

Due to the uncertain nature of unstructured data systems, it is imperative to identify the information you would like to extract from the documents you’re working on.

So in our case, there’s a need to first review the documents and understand the keywords in the document, the potential order of the wordings and the natural placement of the each of the information to be extracted.

In our case, I have taken some time to review a couple of input files that will be used before writing the code for the extraction. You will also notice the structure is not fix, for example ANNUAL SALARY, the upper and lower bound is not fix and could be change, same as the position of the class code and others

unstructured files with apache spark

apache spark unstructure files

apache spark unstructured files

unstructured files json

Regular Expressions: The Knight in Shiny Armor!

Regular expressions (regex) are a powerful tool for searching and manipulating text. They enable the identification of patterns within text data, making them indispensable for parsing complex job postings. Job postings, which often come in various formats and structures, can be challenging to parse. However, with regular expressions, extracting key information such as job titles, descriptions, qualifications, and skills becomes manageable. The challenge of parsing could be a little complicated if you’re not familiar with regular expressions, regardless, it should be fairly easy to pick it up if you’re interested in it.

The Challenge of Parsing

Parsing job postings can be daunting due to their unstructured nature. Different organizations may use different formats and terminologies to describe job roles, responsibilities, and qualifications. Regular expressions come to the rescue by providing a flexible and efficient method to search for and match patterns within this unstructured data, enabling the extraction of relevant information.

Learning Curve

While the syntax of regular expressions may seem intimidating at first, the basics are straightforward to grasp. There are numerous resources and tutorials available that can help beginners get up to speed quickly. The investment in learning regex pays off by significantly simplifying the data parsing process.

Setting up our Spark Streaming Client

Apache Spark Streaming is a scalable, high-throughput, fault-tolerant streaming processing system that supports real-time data processing. Setting up a Spark Streaming client involves configuring Spark to receive real-time data streams, which in this case, could be streams of job postings from various sources.

Configuration Steps:

  1. Install Apache Spark and any necessary dependencies on your system or cloud environment, in our case we’ll be leveraging our spark clusters on the docker engine.
  2. Initialize a Spark Streaming Context, specifying the batch interval (e.g., 1 second) to define how often your streaming application will process data.
  3. Define Input Sources, such as Kafka or Flume, that will push the job posting data into your Spark Streaming application in our case, we’ll focus on listening to specific folders where the data will be dropping and parsed in realtime.
# create a spark session
spark = (SparkSession.builder.appName('AWS_Spark_Unstructured')
.config('spark.jars.packages',
'org.apache.hadoop:hadoop-aws:3.3.1,'
'com.amazonaws:aws-java-sdk:1.11.469')
.config('spark.hadoop.fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')
.config('spark.hadoop.fs.s3a.access.key', configuration.get('AWS_ACCESS_KEY'))
.config('spark.hadoop.fs.s3a.secret.key', configuration.get('AWS_SECRET_KEY'))
.config('spark.hadoop.fs.s3a.aws.credentials.provider',
'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider')
.getOrCreate())

#directories to listen to
text_input_dir = 'file:///opt/bitnami/spark/jobs/input/input_text'
json_input_dir = 'file:///opt/bitnami/spark/jobs/input/input_json'
csv_input_dir = 'file://input/input_csv'
pdf_input_dir = 'file://input/input_pdf'
video_input_dir = 'file://input/input_video'
img_input_dir = 'file://input/input_img'

# reading the data text input from the folder
job_bulletins_df = (spark.readStream
.format('text')
.option('wholetext', 'true')
.load(text_input_dir)
)

Parsing and Extracting Information from the Files

Parsing

Utilizing Spark Streaming and regular expressions, the parsing process involves:

  1. Defining Regex Patterns: Identify patterns for the information you want to extract from job postings, such as qualifications (e.g., “Bachelor’s degree”) or skills (e.g., “Python programming”).
  2. Applying Patterns to Streams: Use Spark’s map function to apply these regex patterns to each item in the data stream, extracting the desired information.
# extracting salary from the file
import re # Make sure to import the re module for regex operations

def extract_salary(file_content):
# Try-except block to handle any potential exceptions that may arise during execution
try:
# Define a regex pattern to match salary ranges within the file content.
# This pattern looks for strings that start with a dollar sign, followed by an amount (with commas for thousands),
# possibly followed by "to" and another amount, indicating a salary range.
# It also accounts for a possible second range, separated by "and".
salary_pattern = r'\$(\d{1,3}(?:,\d{3})+).+?to.+\$(\d{1,3}(?:,\d{3})+)(?:\s+and\s+\$(\d{1,3}(?:,\d{3})+)\s+to\s+\$(\d{1,3}(?:,\d{3})+))?'

# Use the re.search method to find a match for the salary pattern within the file content.
salary_match = re.search(salary_pattern, file_content)

# If a match is found, extract the salary range(s)
if salary_match:
# Convert the starting salary to a float, removing commas for thousands
salary_start = float(salary_match.group(1).replace(',', ''))
# If a fourth group is matched (indicating a second range), use its start value;
# otherwise, use the second group's value as the end salary.
salary_end = float(salary_match.group(4).replace(',', '')) if salary_match.group(4) \
else float(salary_match.group(2).replace(',', ''))
else:
# If no match is found, set both start and end salaries to None
salary_start, salary_end = None, None

# Return the extracted salary range
return salary_start, salary_end

# Catch any exception that occurs during the try block execution
except Exception as e:
# Raise a ValueError with a message indicating the error in extracting the salary
raise ValueError(f'Error extracting salary: {str(e)}')


"""
For example:
Input: Salary: $55,332 to $80,930 and $65,145 to $95,254
Output: (55332.0, 95254.0)

The same style is replicated for each of the data to b extracted.
"""

Writing Results to AWS

After parsing and extracting the necessary information from job postings, the next step is to store this data for further analysis or immediate use. AWS provides a robust and scalable solution for storing and managing data.

  1. Choose an AWS Storage Service: Depending on your needs, you might opt for Amazon S3 for durable storage, Amazon DynamoDB for NoSQL database services, or Amazon RDS for relational database services. We’ll opt in for Amazon AWS S3 storage, writing parquet files into our bucket.
  2. Configure Spark to Write to AWS: Utilize the appropriate AWS SDK in your Spark application to connect and write the parsed data to the chosen AWS service. Ensure that your AWS credentials are securely managed and that your Spark application has the necessary permissions to write to the AWS service.
def streamWriter(input: DataFrame, checkpointFolder, output):
return (input.writeStream.
format('parquet')
.option('checkpointLocation', checkpointFolder)
.option('path', output)
.outputMode('append')
.trigger(processingTime='5 seconds')
.start()
)

query = streamWriter(union_dataframe, 's3a://spark-unstructured-streaming/checkpoints/',
's3a://spark-unstructured-streaming/data/spark_unstructured')

query.awaitTermination()

After writing, the output looks like this:

aws s3 bucket storage for unstructured files

 

Crawling, Previewing and Verifying of result

The next thing is to use AWS Glue crawler to read the data in the parquets, create a table and store them in a database.

aws glue for crawling unstructured files

After creating the crawler, run it to get the processing underway on demand or scheduled, depending on your use case.

After running, navigate to tables in Glue and click on “Table data” in the view data column.

aws glue crawler table for unstructured data

That takes you to Athena where you can preview the output of the entire process.

aws athena querying unstructured files

Deploying our Spark Job to Spark Cluster

At this point, everything is working as expected, the next thing is to deploy the process to the spark master-worker cluster created in our docker instance.

The simplest way to make minimal changes is to move the files and folders into the jobs but you can make this fancier to suit your use case and coding style.

deploying apache spark jobs to spark cluster

Finally, you can submit the jobs with the command below:

docker exec -it aws_spark_unstructured-spark-master-1 spark-submit \
--master spark://spark-master:7077 \
--packages org.apache.hadoop:hadoop-aws:3.3.1,com.amazonaws:aws-java-sdk:1.11.469 \
jobs/main.py

submitting apache spark jobs to spark cluster

spark cluster dashboard

running apache spark on apache spark cluster

Once the process is running, you can repeat the process of crawling and preview until you get the desired results.

Thanks for reading!

Resources

Full Course with Code and Dataset

0 Comments

Leave a comment