E-Commerce Data Pipeline

V1: The Prototype

I revisited this project after a previous attempt to scrape and analyze the AliExpress website. In my previous attempt, I used a prebuilt scrapfly scraper skeleton, which resulted in Architectural limitations that lacked robustness. The original pipeline from Python to an SQL database allowed me to analyze and visualize around 500 products using Tableau. My Beta code and dashboard are linked below, accessible through GitHub and Tableau Public, respectively.

V2: The Production Pipeline

The Beta attempt shifted my approach due to the difficulty of scraping E-Commerce websites, equipped with extensive countermeasures. So, I shifted to a double-scraper approach, with the first one focusing on Breadth: finding general product data found on search pages.

The second, depth-focused, scraper uses the first scraper's product data to narrow down on key features hidden within product pages. This two-tiered approach allows me to scrape a large quantity of products without sacrificing the quality of data needed to beat most commercial scrapers.

Feature Engineering: Review Velocity Score

The best showcase of this data pipeline lies in the Review Velocity Score, a custom metric I engineered to identify trending products before the market becomes saturated.

Standard e-commerce metrics for commercial scrapers, such as "Total Sold," are lagging indicators. To find an Alpha feature, or leading indicator, I developed a Python algorithm that:

1) Extracts unstructured review data from the product DOM.

2) Parses and normalizes inconsistent date formats.

3) Filters for activity within a dynamic 7-day window.

4) Calculates a weighted score based on review volume and rating.

Data Pipeline Workflow:

Input: Raw Unstructured DOM

Processing Logic Snapshot

Output: Structured Feature Data

Developing this logic was a technical hurdle for the project due to the variability of AliExpress's frontend JavaScript. For example, Shadow containers, modal scrolling, and drop-down menus required robust exception handling.

Future Roadmap

With the Alpha feature logic successfully validated, the next step in development is to integrate the General Scraper (Breadth) to feed high-volume search data into the Depth Scraper's scoring engine.

Once the pipeline is fully automated, I will implement a SQL warehouse to build a historical dataset. I intend to extend the project's analytical capabilities as well by training a Gradient Boosting model (XGBoost).

The goal is to move from descriptive analytics to predictive analytics, forecasting which products are likely to trend based on features such as the review velocity score. In the end, I will have a competitive model that I can test for accuracy through real-world market validation (A/B testing).

Technologies: Python, Selenium, BeautifulSoup, Pandas, SQL, Tableau

Page updated

Google Sites