DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Enterprise AI Trend Report: Gain insights on ethical AI, MLOps, generative AI, large language models, and much more.

2024 Cloud survey: Share your insights on microservices, containers, K8s, CI/CD, and DevOps (+ enter a $750 raffle!) for our Trend Reports.

PostgreSQL: Learn about the open-source RDBMS' advanced capabilities, core components, common commands and functions, and general DBA tasks.

AI Automation Essentials. Check out the latest Refcard on all things AI automation, including model training, data security, and more.

Related

  • BigQuery DataFrames in Python
  • TAO: A Comprehensive Look at Facebook's Distributed Data Store
  • Top 5 Incidents and Outages of 2021
  • Playing With Pandas DataFrames (With Missing Values Table Example)

Trending

  • PostgresML: Streamlining AI Model Deployment With PostgreSQL Integration
  • OWASP Top 10 Explained: SQL Injection
  • Python for Beginners: An Introductory Guide to Getting Started
  • Running LLMs Locally: A Step-by-Step Guide
  1. DZone
  2. Data Engineering
  3. Data
  4. Performing Advanced Facebook Event Data Analysis With a Vector Database

Performing Advanced Facebook Event Data Analysis With a Vector Database

In this post, discover a comprehensive application designed to scrape event data from Facebook and analyze the scraped data using MyScale.

By 
Usama Jamil user avatar
Usama Jamil
·
Mar. 08, 24 · Tutorial
Like (2)
Save
Tweet
Share
2.8K Views

Join the DZone community and get the full member experience.

Join For Free

In today's digital age, professionals across all industries must stay updated with upcoming events, conferences, and workshops. However, efficiently finding events that align with one's interests amidst the vast ocean of online information presents a significant challenge.

This blog introduces an innovative solution to this challenge: a comprehensive application designed to scrape event data from Facebook and analyze the scraped data using MyScale. While MyScale is commonly associated with the RAG tech stack or used as a vector database, its capabilities extend beyond these realms. We will utilize it for data analysis, leveraging its vector search functionality to analyze events that are semantically similar, thus providing better results and insights.

You may notice that Grok AI utilized the Qdrant vector database as the search engine to retrieve real-time information from X (formerly known as Twitter) data. You can also assess the power of vector databases in this way with MyScale by integrating MyScale with other platforms like Apify to enhance daily life tasks through the development of simple personalized applications.

So in this blog, let’s develop an application that takes only the name of a city as input and scrapes all related events from Facebook. Subsequently, we will conduct data analysis and semantic search using the advanced SQL vector capabilities of MyScale.

Tools and Technologies

We’ll use several tools, including Apify, MyScale, and OpenAI, to develop this useful application.

  • Apify: A popular web scraping and automation platform that significantly streamlines the process of data collection. It provides the capability to scrape data and subsequently feed it to LLMs. This allows us to train LLMs on real-time data and develop applications.
  • MyScale: MyScale is a SQL vector database that we use to store and process both structured and unstructured data in an optimized way.
  • OpenAI: We will use the model text-embedding-3-small from OpenAI to get the embeddings of the text and then save those embeddings in MyScale for data analysis and semantic search.

How To Set Up MyScale and Apify

To start setting up MyScale and Apify, you'll need to create a new directory and a Python file. You can do this by opening your terminal or command line and entering the following commands:

Shell
 
mkdir MyScale
cd MyScale
touch main.ipynb


Let's install the packages. Copy the command below, and paste it into your terminal. These packages will provide the tools and libraries we need to develop our application.

Shell
 
pip install openai apify-client clickhouse-connect pandas numpy


This should install all the dependencies in your system. To confirm that everything is installed properly, you can enter the following command in your terminal.

Shell
 
pip freeze | egrep '(openai|apify-client|clickhouse-connect|pandas|numpy)'


This should include all the installed dependencies with their versions. If you spot any missing dependencies, you may need to re-run the installation command for that specific package. Now, we're ready to write our code after the installations.

Note: We will be working in a Python notebook. Consider every code block a notebook cell.

How to Scrape Data With Apify

Now, we will use the Apify API to scrape event data of New York City using Facebook Events scraper. 

Python
 
     import pandas as pd
from apify_client import ApifyClient
# Initialize the ApifyClient with your API token
client = ApifyClient("Enter_your_apify_key_here")

# Prepare the Actor input
run_input = {
    "searchQueries": ["Sport New York"],
    "startUrls": [],
    "maxEvents": 50,
}

# Run the Actor and wait for it to finish
run = client.actor("UZBnerCFBo5FgGouO").call(run_input=run_input)

df_columns = ['Name', 'Datetime', 'Description', 'Users_Going', 'Users_Interested', 'Users_Responded', 'City', 'Organized_By', 'Street_Address']
dataframe1 = pd.DataFrame(columns=df_columns)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    # Use a dictionary comprehension to replace None values with an empty string
    row = {
        'Name': item.get('name', ''),
        'Datetime': item.get('dateTimeSentence', ''),
        'Description': item.get('description', ''),
        'Users_Going': item.get('usersGoing', ''),
        'Users_Interested': item.get('usersInterested', ''),
        'Users_Responded': item.get('usersResponded', ''),
        'City': item.get('location', {}).get('city', '') if item.get('location') else '',
        'Organized_By': item.get('organizedBy', ''),
        'Street_Address': item.get('location', {}).get('streetAddress', '') if item.get('location') else ''
    }
    # Ensure all None values are replaced by an empty string
    row = {k: (v if v is not None else '') for k, v in row.items()}
    dataframe1 = dataframe1._append(row, ignore_index=True)

# Cleaning the data
dataframe1['Description'] = dataframe1['Description'].replace('\\n', '', regex=True)


This script gives us the upcoming event details in the form of a pandas DataFrame.

Note: Don’t forget to add your Apify API key in the script above. You can find your API token on the Integrations page in the Apify Console.

Data Pre-Processing

When we gather raw data, it comes in various formats. In this script, we will bring the event dates into a single format so that our data filtering can be more efficiently done.

Python
 
     # Import necessary libraries for data manipulation and date parsing
import pandas as pd
import numpy as np
from datetime import datetime
from dateutil import parser

# Function to parse date strings that may represent a range or a single date
def parse_dates(date_str):
    # Check if the date string contains a dash, indicating a range
    if '-' in date_str:
        parts = date_str.split('-')
        # If the string splits into two parts, it's a valid range
        if len(parts) == 2:
            try:
                # Parse start and end dates, formatting them to a readable format
                start_date = parser.parse(parts[0], fuzzy=True).strftime('%a, %b %d')
                end_date = parser.parse(parts[1], fuzzy=True).strftime('%a, %b %d')
                return start_date, end_date
            except ValueError:
                # In case of parsing error, do nothing (will handle below)
                pass  
    # If not a range or if parsing the range failed, try parsing as a single date
    try:
        parsed_date = parser.parse(date_str, fuzzy=True)
        # Format the single date for start_date and format differently for end_date
        start_date = parsed_date.strftime('%a, %b %d AT %I:%M %p EDT')
        end_date = parsed_date.strftime('%a, %b %d')  # Omitting time for end_date
        return start_date, end_date
    except ValueError:
        # Return NaN for both dates if parsing fails
        return np.nan, np.nan  

# Function to extract detailed date, time, and day from a date string
def extract_date_time_day(date_str):
    try:
        # Parse the date string, allowing for some flexibility in input format
        parsed_date = parser.parse(date_str, fuzzy=True)
        # Extract and format the date, time, and day parts
        date = parsed_date.strftime('%Y-%m-%d')
        day = parsed_date.strftime('%a')
        # Determine if the original string included a time component
        time_component = parsed_date.strftime('%I:%M %p') not in date_str
        time = parsed_date.strftime('%H:%M:%S') if not time_component else np.nan
    except ValueError:
        # If parsing fails, set date, time, and day to NaN
        date, time, day = np.nan, np.nan, np.nan
    
    return date, time, day

# Apply parse_dates function across the dataframe, creating new columns for start and end dates
dataframe1[['Start_Date', 'End_Date']] = dataframe1.apply(lambda row: pd.Series(parse_dates(row['Datetime'])), axis=1)

# Drop rows where Start_Date is NaN, indicating parsing was unsuccessful
dataframe = dataframe1.dropna(subset=['Start_Date'])

# Apply extract_date_time_day to split the start and end dates into separate date, time, and day columns
dataframe['Start_Date'], dataframe['Start_Time'], dataframe['Start_Day'] = zip(*dataframe['Start_Date'].apply(extract_date_time_day))
dataframe['End_Date'], _, dataframe['End_Day'] = zip(*dataframe['End_Date'].apply(extract_date_time_day))

# Remove the original 'Datetime' column as it's no longer needed
dataframe=dataframe.drop(['Datetime'], axis=1)

# Convert 'Start_Date' and 'End_Date' to datetime format, extracting just the date part
dataframe['Start_Date'] = pd.to_datetime(dataframe['Start_Date']).dt.date
dataframe['End_Date'] = pd.to_datetime(dataframe['End_Date']).dt.date

# Convert 'Start_Time' to datetime format, retaining the time information
dataframe['Start_Time'] = pd.to_datetime(dataframe['Start_Time'])


This code snippet uses pandas with datetime and dateutil packages of Python to format the data.

Generating Embeddings

To deeply understand and search events, we will generate embeddings from their descriptions using the text-embedding-3-small. These embeddings capture the semantic essence of each event, helping the application to return better results.

Python
 
     # Import OpenAI library for API access.
from openai import OpenAI 
# Initialize OpenAI client with an API key.
openai_client = OpenAI(api_key="your_openai_api_key_here")
# Function to get text embeddings 
def get_embedding(text, model="text-embedding-3-small"):
    return openai_client.embeddings.create(input=text, model=model).data
embeddings = get_embedding(dataframe["Description"].tolist())
# Extract embedding vectors from the embeddings object
vectors = [embedding.embedding for embedding in embeddings]
array = np.array(vectors)
embeddings_series = pd.Series(list(array))
# Add embeddings as a new column in the DataFrame.
dataframe['Description_Embeddings'] = embeddings_series


Now, we will insert the new DataFrame with embeddings into MyScale.

Connecting With MyScale

As discussed at the start, we will use the MyScale as a vector database for storing and managing data. Here, we will connect to MyScale in preparation for data storage.

Python
 
     import clickhouse_connect
client = clickhouse_connect.get_client(
    host='host_name_here',
    port=443,
    username='username_here',
    password='passwd_here'
)


This connection setup ensures our application can communicate with MyScale, and use SQL’s power for data manipulation and analysis.

Note: See Connection Details for more information on how to connect to the MyScale cluster.

Create Tables and Indexes Using MyScale

We now create a table according to our DataFrame. All the data will be stored in this table, including the embeddings.

Python
 
     client.command("""
    CREATE TABLE default.Events (
    Name String,
    Description String,
    Users_Going Int64,
    Users_Interested Int64,
    Users_Responded Int64,
    City String,
    Organized_By String,
    Street_Address String,
    Start_Date Date32,
    End_Date Nullable(Date32),
    Start_Time Nullable(DateTime64),
    Start_Day String,
    End_Day String,
    Description_Embeddings Array(Float32),
    CONSTRAINT check_data_length CHECK length(Description_Embeddings) = 1536
    ) ENGINE = MergeTree()
    ORDER BY (Name);
    """)


The above SQL statements create a table named Events on the cluster. The CONSTRAINT makes sure that all the vectors embedding are of the same length 1536.

Storing the Data and Creating an Index in MyScale

In this step, we insert the processed data into MyScale. This involves batch-inserting the data to ensure efficient storage and retrieval.

Python
 
     batch_size = 10  # Adjust based on your needs

num_batches = len(dataframe) // batch_size

for i in range(num_batches):
    start_idx = i * batch_size
    end_idx = start_idx + batch_size
    batch_data = dataframe[start_idx:end_idx]
    # print(batch_data["Description_Embeddings"])
    client.insert("default.Events", batch_data.to_records(index=False).tolist(), column_names=batch_data.columns.tolist())
    print(f"Batch {i+1}/{num_batches} inserted.")

client.command("""
ALTER TABLE default.Events
    ADD VECTOR INDEX vector_index Description_Embeddings
    TYPE MSTG
""")


Using pandas, the above code efficiently transfers our prepared dataset into the MyScale database.

Data Analysis Using MyScale

Finally, we use MyScale’s analytical capabilities to perform analysis and enable semantic search. By executing SQL queries, we can analyze events based on topics, locations, and dates. So, let’s try to write some queries.

Simple SQL query

Let’s first try to get the top 10 results from the table.

Python
 
     results=client.query("""
        SELECT Name,Description FROM default.Events LIMIT 10
    """)
for row in results.named_results():
        print(row["Name"])
        print(row['Description'])


This query will simply return the top 10 results from the events table.

Discover Events by Semantic Relevance

Let’s try to find the top 10 upcoming events with a vibe similar to a reference event, like this: “One of the Longest Running Shows in the Country - Operating since 1974 ...NOW our 50th YEAR !!! Our Schenectady”. This is achieved by comparing semantic embeddings of event descriptions, ensuring a match in themes and emotions.

Python
 
embeddings=get_embedding(["One of the Longest Running Shows in the Country - Operating since 1974 ...NOW our 50th YEAR !!!Our Schenectady"])
embedding=embeddings[0].embedding
results = client.query(f"""
        SELECT Name, Description,
        distance(Description_Embeddings, {embedding}) as dist FROM default.Events ORDER BY dist LIMIT 10
    """)
for row in results.named_results():
        print("Title of the event  ", row["Name"])
        print("Description of the event  ", row['Description'])
        print("Distance : ", row["dist"])


Trending Events by Popularity

This query ranks the top 10 events by the number of attendees and interested users, highlighting popular events from large city festivals to major conferences. It is ideal for those seeking to join large, energetic gatherings.

Python
 
     results = client.query(f"""
        SELECT Name, City, Users_Going, Users_Interested, Users_Responded
        FROM default.Events
        ORDER BY Users_Going DESC, Users_Interested DESC
        LIMIT 10
    """)
for row in results.named_results():
        print("Event Name  ", row["Name"])
        print("City ", row["City"])
        print("Users Going ", row["Users_Going"])
        print("Interested Users ", row["Users_Interested"])


Popular Local Events in New York

Combining relevance and popularity, this query identifies similar events in New York City related to a specific event and ranks them by attendance, offering a curated list of events that reflect the city's vibrant culture and attract local interest.

Python
 
     embeddings=get_embedding(["One of the Longest Running Shows in the Country - Operating since 1974 ...NOW our 50th YEAR !!!Our Schenectady"])
embeddi=embeddings[0].embedding
results = client.query(f"""
        SELECT Name,City, Description, Users_Going,distance(Description_Embeddings, {embeddi}) as dist
        FROM default.Events
        WHERE City LIKE '%New York%' and dist < 1.5
        ORDER BY Users_Going DESC,dist
        LIMIT 10
    """)
for row in results.named_results():
        print("Event Name  ", row["Name"])
        print("Description ", row["Description"])
        print("Users Going ", row["Users_Going"])


Leading Event Organizers

This query ranks the top 10 event organizers by the total number of attendees and interested users, highlighting those who excel in creating compelling events and attracting large audiences. It provides insights for event planners and attendees interested in top-tier events.

Python
 
     # Which client has attracted the most number of users 
results = client.query(f"""
       SELECT Organized_By, SUM(Users_Going + Users_Interested) AS Total_Users
        FROM default.Events
        GROUP BY Organized_By
        ORDER BY Total_Users DESC
        Limit 10
    """)
for row in results.named_results():
        print("Event Name  ", row["Organized_By"])
        print("Total_Users ", row["Total_Users"])


Implement RAG

Previously, we have explored MyScale for data analysis, highlighting its capabilities in enhancing our data workflows. Moving forward, we'll go one step ahead by implementing Retrieval-Augmented Generation (RAG), an innovative framework combining an external knowledge base with LLMs. This step will help you to understand your data better and find more detailed insights. Next, you'll see how to use RAG with MyScale, which will make working with data more interesting and productive.

Python
 
     from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

query="Can you please suggest me some events related to basketball"
# Use the get_embedding method defined above, it accepts a list of sentences
embeddings=get_embedding([query])
embeddings=embeddings[0].embedding
results = client.query(f"""
        SELECT Name, City, Users_Going, Description, distance(Description_Embeddings, {embeddings}) as dist
        FROM default.Events
        ORDER BY Users_Going DESC,dist
        LIMIT 5
    """)
PROMPT_TEMPLATE = """
Your objective is to formulate a response to a question using only the information provided below:
{context}
---
Your task is to carefully analyze the provided context and provide an answer to the following question based solely on the information given:
{question}
"""
# Combine the descriptions of the top results. 
descriptions = [row["Description"] for row in results.named_results()]
context_text = "\n\n---\n\n".join(descriptions)
prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
prompt = prompt_template.format(context=context_text, question=query)
model = ChatOpenAI(openai_api_key="your_api_key_here")
response_text = model.predict(prompt)
formatted_response = f"Response: {response_text}\n"
print(formatted_response)


Throughout this blog, we have observed that MyScale is much more than a vector database that can be used to develop all kinds of applications. We can use it as a simple SQL database or for advanced AI applications, covering the majority of the development domain. We encourage you to give it a try and explore the advanced features by signing up for the free tier and getting 5 million free vector storage.

Conclusion

We have explored the abilities and functionalities of MyScale with Apify Scraper through the process of developing an event analytics application. MyScale has demonstrated its exceptional capabilities in high-performance vector search while retaining all the functionalities of SQL databases, which helps developers make semantic searches using familiar SQL syntax with much better speed and accuracy.

The capabilities of MyScale are not limited to this application: you can adopt it to develop any AI applications using the RAG method. 

If you have any feedback or suggestions, please reach out to us.

Semantic search Data (computing) facebook Pandas

Published at DZone with permission of Usama Jamil. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • BigQuery DataFrames in Python
  • TAO: A Comprehensive Look at Facebook's Distributed Data Store
  • Top 5 Incidents and Outages of 2021
  • Playing With Pandas DataFrames (With Missing Values Table Example)

Partner Resources


Comments

ABOUT US

  • About DZone
  • Send feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: