Performing Advanced Facebook Event Data Analysis With a Vector Database

In this post, discover a comprehensive application designed to scrape event data from Facebook and analyze the scraped data using MyScale.

Usama Jamil

Mar. 08, 24 · Tutorial

Like (2)

Save

2.8K Views

In today's digital age, professionals across all industries must stay updated with upcoming events, conferences, and workshops. However, efficiently finding events that align with one's interests amidst the vast ocean of online information presents a significant challenge.

This blog introduces an innovative solution to this challenge: a comprehensive application designed to scrape event data from Facebook and analyze the scraped data using MyScale. While MyScale is commonly associated with the RAG tech stack or used as a vector database, its capabilities extend beyond these realms. We will utilize it for data analysis, leveraging its vector search functionality to analyze events that are semantically similar, thus providing better results and insights.

You may notice that Grok AI utilized the Qdrant vector database as the search engine to retrieve real-time information from X (formerly known as Twitter) data. You can also assess the power of vector databases in this way with MyScale by integrating MyScale with other platforms like Apify to enhance daily life tasks through the development of simple personalized applications.

So in this blog, let’s develop an application that takes only the name of a city as input and scrapes all related events from Facebook. Subsequently, we will conduct data analysis and semantic search using the advanced SQL vector capabilities of MyScale.

Tools and Technologies

We’ll use several tools, including Apify, MyScale, and OpenAI, to develop this useful application.

Apify: A popular web scraping and automation platform that significantly streamlines the process of data collection. It provides the capability to scrape data and subsequently feed it to LLMs. This allows us to train LLMs on real-time data and develop applications.
MyScale: MyScale is a SQL vector database that we use to store and process both structured and unstructured data in an optimized way.
OpenAI: We will use the model text-embedding-3-small from OpenAI to get the embeddings of the text and then save those embeddings in MyScale for data analysis and semantic search.

How To Set Up MyScale and Apify

To start setting up MyScale and Apify, you'll need to create a new directory and a Python file. You can do this by opening your terminal or command line and entering the following commands:

     Shell
    
    mkdir MyScale
cd MyScale
touch main.ipynb

Let's install the packages. Copy the command below, and paste it into your terminal. These packages will provide the tools and libraries we need to develop our application.

     Shell
    
    pip install openai apify-client clickhouse-connect pandas numpy

This should install all the dependencies in your system. To confirm that everything is installed properly, you can enter the following command in your terminal.

     Shell
    
    pip freeze | egrep '(openai|apify-client|clickhouse-connect|pandas|numpy)'

This should include all the installed dependencies with their versions. If you spot any missing dependencies, you may need to re-run the installation command for that specific package. Now, we're ready to write our code after the installations.

Note: We will be working in a Python notebook. Consider every code block a notebook cell.

How to Scrape Data With Apify

Now, we will use the Apify API to scrape event data of New York City using Facebook Events scraper .

     Python
    
 

         import pandas as pd
from apify_client import ApifyClient
# Initialize the ApifyClient with your API token
client = ApifyClient("Enter_your_apify_key_here")

# Prepare the Actor input
run_input = {
    "searchQueries": ["Sport New York"],
    "startUrls": [],
    "maxEvents": 50,
}

# Run the Actor and wait for it to finish
run = client.actor("UZBnerCFBo5FgGouO").call(run_input=run_input)

df_columns = ['Name', 'Datetime', 'Description', 'Users_Going', 'Users_Interested', 'Users_Responded', 'City', 'Organized_By', 'Street_Address']
dataframe1 = pd.DataFrame(columns=df_columns)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    # Use a dictionary comprehension to replace None values with an empty string
    row = {
        'Name': item.get('name', ''),
        'Datetime': item.get('dateTimeSentence', ''),
        'Description': item.get('description', ''),
        'Users_Going': item.get('usersGoing', ''),
        'Users_Interested': item.get('usersInterested', ''),
        'Users_Responded': item.get('usersResponded', ''),
        'City': item.get('location', {}).get('city', '') if item.get('location') else '',
        'Organized_By': item.get('organizedBy', ''),
        'Street_Address': item.get('location', {}).get('streetAddress', '') if item.get('location') else ''
    }
    # Ensure all None values are replaced by an empty string
    row = {k: (v if v is not None else '') for k, v in row.items()}
    dataframe1 = dataframe1._append(row, ignore_index=True)

# Cleaning the data
dataframe1['Description'] = dataframe1['Description'].replace('\\n', '', regex=True)
   

This script gives us the upcoming event details in the form of a pandas DataFrame.

Note: Don’t forget to add your Apify API key in the script above. You can find your API token on the Integrations page in the Apify Console.

Data Pre-Processing

When we gather raw data, it comes in various formats. In this script, we will bring the event dates into a single format so that our data filtering can be more efficiently done.

     Python
    
 

         # Import necessary libraries for data manipulation and date parsing
import pandas as pd
import numpy as np
from datetime import datetime
from dateutil import parser

# Function to parse date strings that may represent a range or a single date
def parse_dates(date_str):
    # Check if the date string contains a dash, indicating a range
    if '-' in date_str:
        parts = date_str.split('-')
        # If the string splits into two parts, it's a valid range
        if len(parts) == 2:
            try:
                # Parse start and end dates, formatting them to a readable format
                start_date = parser.parse(parts[0], fuzzy=True).strftime('%a, %b %d')
                end_date = parser.parse(parts[1], fuzzy=True).strftime('%a, %b %d')
                return start_date, end_date
            except ValueError:
                # In case of parsing error, do nothing (will handle below)
                pass  
    # If not a range or if parsing the range failed, try parsing as a single date
    try:
        parsed_date = parser.parse(date_str, fuzzy=True)
        # Format the single date for start_date and format differently for end_date
        start_date = parsed_date.strftime('%a, %b %d AT %I:%M %p EDT')
        end_date = parsed_date.strftime('%a, %b %d')  # Omitting time for end_date
        return start_date, end_date
    except ValueError:
        # Return NaN for both dates if parsing fails
        return np.nan, np.nan  

# Function to extract detailed date, time, and day from a date string
def extract_date_time_day(date_str):
    try:
        # Parse the date string, allowing for some flexibility in input format
        parsed_date = parser.parse(date_str, fuzzy=True)
        # Extract and format the date, time, and day parts
        date = parsed_date.strftime('%Y-%m-%d')
        day = parsed_date.strftime('%a')
        # Determine if the original string included a time component
        time_component = parsed_date.strftime('%I:%M %p') not in date_str
        time = parsed_date.strftime('%H:%M:%S') if not time_component else np.nan
    except ValueError:
        # If parsing fails, set date, time, and day to NaN
        date, time, day = np.nan, np.nan, np.nan
    
    return date, time, day

# Apply parse_dates function across the dataframe, creating new columns for start and end dates
dataframe1[['Start_Date', 'End_Date']] = dataframe1.apply(lambda row: pd.Series(parse_dates(row['Datetime'])), axis=1)

# Drop rows where Start_Date is NaN, indicating parsing was unsuccessful
dataframe = dataframe1.dropna(subset=['Start_Date'])

# Apply extract_date_time_day to split the start and end dates into separate date, time, and day columns
dataframe['Start_Date'], dataframe['Start_Time'], dataframe['Start_Day'] = zip(*dataframe['Start_Date'].apply(extract_date_time_day))
dataframe['End_Date'], _, dataframe['End_Day'] = zip(*dataframe['End_Date'].apply(extract_date_time_day))

# Remove the original 'Datetime' column as it's no longer needed
dataframe=dataframe.drop(['Datetime'], axis=1)

# Convert 'Start_Date' and 'End_Date' to datetime format, extracting just the date part
dataframe['Start_Date'] = pd.to_datetime(dataframe['Start_Date']).dt.date
dataframe['End_Date'] = pd.to_datetime(dataframe['End_Date']).dt.date

# Convert 'Start_Time' to datetime format, retaining the time information
dataframe['Start_Time'] = pd.to_datetime(dataframe['Start_Time'])
   

This code snippet uses pandas with datetime and dateutil packages of Python to format the data.

Generating Embeddings

To deeply understand and search events, we will generate embeddings from their descriptions using the text-embedding-3-small. These embeddings capture the semantic essence of each event, helping the application to return better results.

     Python
    
 

         # Import OpenAI library for API access.
from openai import OpenAI 
# Initialize OpenAI client with an API key.
openai_client = OpenAI(api_key="your_openai_api_key_here")
# Function to get text embeddings 
def get_embedding(text, model="text-embedding-3-small"):
    return openai_client.embeddings.create(input=text, model=model).data
embeddings = get_embedding(dataframe["Description"].tolist())
# Extract embedding vectors from the embeddings object
vectors = [embedding.embedding for embedding in embeddings]
array = np.array(vectors)
embeddings_series = pd.Series(list(array))
# Add embeddings as a new column in the DataFrame.
dataframe['Description_Embeddings'] = embeddings_series
   

Now, we will insert the new DataFrame with embeddings into MyScale.

Connecting With MyScale

As discussed at the start, we will use the MyScale as a vector database for storing and managing data. Here, we will connect to MyScale in preparation for data storage.

     Python
    
 

         import clickhouse_connect
client = clickhouse_connect.get_client(
    host='host_name_here',
    port=443,
    username='username_here',
    password='passwd_here'
)
   

This connection setup ensures our application can communicate with MyScale, and use SQL’s power for data manipulation and analysis.

Note: See Connection Details for more information on how to connect to the MyScale cluster.

Create Tables and Indexes Using MyScale

We now create a table according to our DataFrame. All the data will be stored in this table, including the embeddings.

     Python
    
 

         client.command("""
    CREATE TABLE default.Events (
    Name String,
    Description String,
    Users_Going Int64,
    Users_Interested Int64,
    Users_Responded Int64,
    City String,
    Organized_By String,
    Street_Address String,
    Start_Date Date32,
    End_Date Nullable(Date32),
    Start_Time Nullable(DateTime64),
    Start_Day String,
    End_Day String,
    Description_Embeddings Array(Float32),
    CONSTRAINT check_data_length CHECK length(Description_Embeddings) = 1536
    ) ENGINE = MergeTree()
    ORDER BY (Name);
    """)
   

The above SQL statements create a table named Events on the cluster. The CONSTRAINT makes sure that all the vectors embedding are of the same length 1536.

Storing the Data and Creating an Index in MyScale

In this step, we insert the processed data into MyScale. This involves batch-inserting the data to ensure efficient storage and retrieval.

     Python
    
 

         batch_size = 10  # Adjust based on your needs

num_batches = len(dataframe) // batch_size

for i in range(num_batches):
    start_idx = i * batch_size
    end_idx = start_idx + batch_size
    batch_data = dataframe[start_idx:end_idx]
    # print(batch_data["Description_Embeddings"])
    client.insert("default.Events", batch_data.to_records(index=False).tolist(), column_names=batch_data.columns.tolist())
    print(f"Batch {i+1}/{num_batches} inserted.")

client.command("""
ALTER TABLE default.Events
    ADD VECTOR INDEX vector_index Description_Embeddings
    TYPE MSTG
""")
   

Using pandas, the above code efficiently transfers our prepared dataset into the MyScale database.

Data Analysis Using MyScale

Finally, we use MyScale’s analytical capabilities to perform analysis and enable semantic search. By executing SQL queries, we can analyze events based on topics, locations, and dates. So, let’s try to write some queries.

Simple SQL query

Let’s first try to get the top 10 results from the table.

     Python
    
 

         results=client.query("""
        SELECT Name,Description FROM default.Events LIMIT 10
    """)
for row in results.named_results():
        print(row["Name"])
        print(row['Description'])
   

This query will simply return the top 10 results from the events table.

Discover Events by Semantic Relevance

Let’s try to find the top 10 upcoming events with a vibe similar to a reference event, like this: “One of the Longest Running Shows in the Country - Operating since 1974 ...NOW our 50th YEAR !!! Our Schenectady”. This is achieved by comparing semantic embeddings of event descriptions, ensuring a match in themes and emotions.

     Python
    
 

    embeddings=get_embedding(["One of the Longest Running Shows in the Country - Operating since 1974 ...NOW our 50th YEAR !!!Our Schenectady"])
embedding=embeddings[0].embedding
results = client.query(f"""
        SELECT Name, Description,
        distance(Description_Embeddings, {embedding}) as dist FROM default.Events ORDER BY dist LIMIT 10
    """)
for row in results.named_results():
        print("Title of the event  ", row["Name"])
        print("Description of the event  ", row['Description'])
        print("Distance : ", row["dist"])
   

Trending Events by Popularity

This query ranks the top 10 events by the number of attendees and interested users, highlighting popular events from large city festivals to major conferences. It is ideal for those seeking to join large, energetic gatherings.

     Python
    
 

         results = client.query(f"""
        SELECT Name, City, Users_Going, Users_Interested, Users_Responded
        FROM default.Events
        ORDER BY Users_Going DESC, Users_Interested DESC
        LIMIT 10
    """)
for row in results.named_results():
        print("Event Name  ", row["Name"])
        print("City ", row["City"])
        print("Users Going ", row["Users_Going"])
        print("Interested Users ", row["Users_Interested"])
   

Popular Local Events in New York

Combining relevance and popularity, this query identifies similar events in New York City related to a specific event and ranks them by attendance, offering a curated list of events that reflect the city's vibrant culture and attract local interest.

     Python
    
 

         embeddings=get_embedding(["One of the Longest Running Shows in the Country - Operating since 1974 ...NOW our 50th YEAR !!!Our Schenectady"])
embeddi=embeddings[0].embedding
results = client.query(f"""
        SELECT Name,City, Description, Users_Going,distance(Description_Embeddings, {embeddi}) as dist
        FROM default.Events
        WHERE City LIKE '%New York%' and dist < 1.5
        ORDER BY Users_Going DESC,dist
        LIMIT 10
    """)
for row in results.named_results():
        print("Event Name  ", row["Name"])
        print("Description ", row["Description"])
        print("Users Going ", row["Users_Going"])
   

Leading Event Organizers

This query ranks the top 10 event organizers by the total number of attendees and interested users, highlighting those who excel in creating compelling events and attracting large audiences. It provides insights for event planners and attendees interested in top-tier events.

     Python
    
 

         # Which client has attracted the most number of users 
results = client.query(f"""
       SELECT Organized_By, SUM(Users_Going + Users_Interested) AS Total_Users
        FROM default.Events
        GROUP BY Organized_By
        ORDER BY Total_Users DESC
        Limit 10
    """)
for row in results.named_results():
        print("Event Name  ", row["Organized_By"])
        print("Total_Users ", row["Total_Users"])
   

Implement RAG

Previously, we have explored MyScale for data analysis, highlighting its capabilities in enhancing our data workflows. Moving forward, we'll go one step ahead by implementing Retrieval-Augmented Generation (RAG), an innovative framework combining an external knowledge base with LLMs. This step will help you to understand your data better and find more detailed insights. Next, you'll see how to use RAG with MyScale, which will make working with data more interesting and productive.

     Python
    
 

         from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

query="Can you please suggest me some events related to basketball"
# Use the get_embedding method defined above, it accepts a list of sentences
embeddings=get_embedding([query])
embeddings=embeddings[0].embedding
results = client.query(f"""
        SELECT Name, City, Users_Going, Description, distance(Description_Embeddings, {embeddings}) as dist
        FROM default.Events
        ORDER BY Users_Going DESC,dist
        LIMIT 5
    """)
PROMPT_TEMPLATE = """
Your objective is to formulate a response to a question using only the information provided below:
{context}
---
Your task is to carefully analyze the provided context and provide an answer to the following question based solely on the information given:
{question}
"""
# Combine the descriptions of the top results. 
descriptions = [row["Description"] for row in results.named_results()]
context_text = "\n\n---\n\n".join(descriptions)
prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
prompt = prompt_template.format(context=context_text, question=query)
model = ChatOpenAI(openai_api_key="your_api_key_here")
response_text = model.predict(prompt)
formatted_response = f"Response: {response_text}\n"
print(formatted_response)
   

Throughout this blog, we have observed that MyScale is much more than a vector database that can be used to develop all kinds of applications. We can use it as a simple SQL database or for advanced AI applications, covering the majority of the development domain. We encourage you to give it a try and explore the advanced features by signing up for the free tier and getting 5 million free vector storage.

Conclusion

We have explored the abilities and functionalities of MyScale with Apify Scraper through the process of developing an event analytics application. MyScale has demonstrated its exceptional capabilities in high-performance vector search while retaining all the functionalities of SQL databases, which helps developers make semantic searches using familiar SQL syntax with much better speed and accuracy.

The capabilities of MyScale are not limited to this application: you can adopt it to develop any AI applications using the RAG method.

If you have any feedback or suggestions, please reach out to us.

Semantic search Data (computing) facebook Pandas

Published at DZone with permission of Usama Jamil. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending