Skip to content

Commit 242533d

Browse files
committed
started sata engineering basics
1 parent fc20393 commit 242533d

File tree

8 files changed

+771
-0
lines changed

8 files changed

+771
-0
lines changed
Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
---
2+
title: Mastering APIs for Data Collection
3+
sidebar_label: APIs
4+
description: "A deep dive into REST and GraphQL APIs: how to fetch, authenticate, and process external data for machine learning."
5+
tags: [apis, rest, graphql, json, data-engineering, python-requests]
6+
---
7+
8+
In the Data Engineering lifecycle, **APIs** are the "clean" way to collect data. Unlike web scraping, which is brittle and unstructured, APIs provide a contract-based method to access data that is versioned, documented, and usually delivered in machine-readable formats like JSON.
9+
10+
## 1. How APIs Work: The Request-Response Cycle
11+
12+
An API acts as a middleman between your ML pipeline and a remote server. You send a **Request** (a specific question) and receive a **Response** (the data answer).
13+
14+
```mermaid
15+
sequenceDiagram
16+
participant Pipeline as ML Data Pipeline
17+
participant API as API Gateway
18+
participant Server as Data Server
19+
20+
Pipeline->>API: HTTP Request (GET /data)
21+
Note right of Pipeline: Includes Headers & API Key
22+
API->>Server: Validate & Route
23+
Server-->>API: Data Payload
24+
API-->>Pipeline: HTTP Response (200 OK + JSON)
25+
26+
```
27+
28+
### Components of an API Request:
29+
30+
1. **Endpoint (URL):** The address where the data lives (e.g., `api.twitter.com/v2/tweets`).
31+
2. **Method:** What you want to do (`GET` to fetch, `POST` to send).
32+
3. **Headers:** Metadata like your **API Key** or the format you want (`Content-Type: application/json`).
33+
4. **Parameters:** Filters for the data (e.g., `?start_date=2023-01-01`).
34+
35+
## 2. Common API Architectures in ML
36+
37+
### A. REST (Representational State Transfer)
38+
39+
The most common architecture. It treats every piece of data as a "Resource."
40+
41+
* **Best for:** Standardized data fetching.
42+
* **Format:** Almost exclusively **JSON**.
43+
44+
### B. GraphQL
45+
46+
Developed by Meta, it allows the client to define the structure of the data it needs.
47+
48+
* **Advantage in ML:** If a user profile has 100 fields but you only need 3 features for your model, GraphQL prevents "Over-fetching," saving bandwidth and memory.
49+
50+
[Image comparing REST vs GraphQL data fetching efficiency]
51+
52+
### C. Streaming APIs (WebSockets/gRPC)
53+
54+
Used when data needs to be delivered in real-time.
55+
56+
* **ML Use Case:** Algorithmic trading or live social media sentiment monitoring.
57+
58+
## 3. Implementation in Python
59+
60+
The `requests` library is the standard tool for interacting with APIs.
61+
62+
```python
63+
import requests
64+
65+
url = "https://api.example.com/v1/weather"
66+
headers = {
67+
"Authorization": "Bearer YOUR_TOKEN"
68+
}
69+
params = {
70+
"city": "Mandsaur",
71+
"country": "IN",
72+
"units": "metric"
73+
}
74+
75+
response = requests.get(url, headers=headers, params=params)
76+
77+
if response.status_code == 200:
78+
data = response.json()
79+
temperature = data["main"]["temp"] # Extracting temperature
80+
humidity = data["main"]["humidity"] # Extracting humidity
81+
82+
print(f"Temperature in Mandsaur: {temperature}°C")
83+
print(f"Humidity: {humidity}%")
84+
else:
85+
print("Failed to fetch weather data")
86+
87+
```
88+
89+
## 4. Challenges: Rate Limiting and Status Codes
90+
91+
APIs are not infinite resources. Providers implement **Rate Limiting** to prevent abuse.
92+
93+
| Status Code | Meaning | Action for ML Pipeline |
94+
| --- | --- | --- |
95+
| **200** | OK | Process the data. |
96+
| **401** | Unauthorized | Check your API Key/Token. |
97+
| **404** | Not Found | Check your Endpoint URL. |
98+
| **429** | Too Many Requests | **Exponential Backoff:** Wait and try again later. |
99+
100+
```mermaid
101+
flowchart TD
102+
Req[Send API Request] --> Res{Status Code?}
103+
Res -- 200 --> Save[Ingest to Database]
104+
Res -- 429 --> Wait[Wait/Sleep] --> Req
105+
Res -- 401 --> Fail[Alert Developer]
106+
style Wait fill:#fff3e0,stroke:#ef6c00,color:#333
107+
108+
```
109+
110+
## 5. Authentication Methods
111+
112+
1. **API Keys:** A simple string passed in the header.
113+
2. **OAuth 2.0:** A more secure, token-based system used by Google, Meta, and Twitter.
114+
3. **JWT (JSON Web Tokens):** Often used in internal microservices.
115+
116+
## References for More Details
117+
118+
* **[REST API Tutorial](https://restfulapi.net/):** Understanding the principles of RESTful design.
119+
120+
121+
* **[Python Requests Guide](https://requests.readthedocs.io/en/latest/):** Mastering HTTP requests for data collection.
122+
123+
---
124+
125+
APIs give us structured data, but sometimes the "front door" is locked. When there is no API, we must use the more aggressive "side window" approach.
Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
---
2+
title: Data Sources in ML
3+
sidebar_label: Data Sources
4+
description: "Identifying and integrating various data sources: from relational databases and APIs to unstructured web data and IoT streams."
5+
tags: [data-engineering, data-sources, sql, nosql, apis, web-scraping]
6+
---
7+
8+
Data is the "fuel" for Machine Learning. However, this fuel is rarely found in one place. As a data engineer, your job is to identify where the raw data lives and how to transport it safely into your environment for processing.
9+
10+
## 1. The Data Source Landscape
11+
12+
We generally categorize data sources based on their **Structure** and their **Storage Method**.
13+
14+
```mermaid
15+
graph TD
16+
Root[Data Sources] --> Structured[Structured]
17+
Root --> Semi[Semi-Structured]
18+
Root --> Unstructured[Unstructured]
19+
20+
Structured --> SQL[Relational DBs: MySQL, Postgres]
21+
Semi --> Files[JSON, XML, Parquet]
22+
Unstructured --> Media[Images, Video, Audio, PDF]
23+
24+
```
25+
26+
## 2. Common Data Sources
27+
28+
### A. Relational Databases (SQL)
29+
30+
The most common source for tabular data (customer records, transactions).
31+
32+
* **Protocol:** SQL (Structured Query Language).
33+
* **Pros:** Highly reliable (ACID compliant), easy to join tables.
34+
* **Cons:** Hard to scale horizontally; requires a fixed schema.
35+
36+
### B. NoSQL Databases
37+
38+
Used for high-volume, high-velocity, or non-tabular data.
39+
40+
* **Key-Value Stores:** Redis.
41+
* **Document Stores:** MongoDB (Stores data as JSON/BSON).
42+
* **ML Use Case:** Storing user profiles or real-time feature stores.
43+
44+
### C. APIs (Application Programming Interfaces)
45+
46+
Used to pull data from external services like Twitter, Google Maps, or Financial markets.
47+
48+
* **Format:** Usually **JSON** or **REST**.
49+
* **Challenges:** Rate limiting (you can only pull so much data per hour) and authentication.
50+
51+
### D. Cloud Object Storage (The Data Lake)
52+
53+
Services like **AWS S3** or **Google Cloud Storage** act as a dumping ground for raw files before they are processed.
54+
55+
* **ML Use Case:** Storing millions of images for a Computer Vision model.
56+
57+
## 3. Batch vs. Streaming Sources
58+
59+
How the data arrives at your model is just as important as where it comes from.
60+
61+
| Feature | Batch Processing | Stream Processing |
62+
| --- | --- | --- |
63+
| **Source** | Databases, CSV files, Data Lakes | Kafka, Kinesis, IoT Sensors |
64+
| **Frequency** | Hourly, Daily, Weekly | Real-time (Milliseconds) |
65+
| **Use Case** | Training a model on historical sales | Predicting fraud during a transaction |
66+
67+
```mermaid
68+
flowchart LR
69+
S1[(Database)] -->|Batch| B[ETL Process]
70+
S2{{IoT Sensor}} -->|Stream| P[Real-time Pipeline]
71+
B --> DL[Data Lake]
72+
P --> DL
73+
style P fill:#fff3e0,stroke:#ef6c00,color:#333
74+
75+
```
76+
77+
## 4. Web Scraping & Crawling
78+
79+
When data isn't available via API or DB, we use scrapers (like `BeautifulSoup` or `Scrapy`) to extract information from HTML.
80+
81+
* **Ethics Check:** Always check a site's `robots.txt` before scraping to ensure you are legally and ethically allowed to take the data.
82+
83+
## 5. Identifying High-Quality Sources
84+
85+
Not all data sources are equal. When evaluating a source for an ML project, ask:
86+
87+
1. **Freshness:** How often is this data updated?
88+
2. **Reliability:** Does the source go down often?
89+
3. **Completeness:** Does it have missing values ()?
90+
4. **Granularity:** Is the data at the level we need (e.g., individual transactions vs. daily totals)?
91+
92+
## References for More Details
93+
94+
* **[Google Cloud - Data Source Types](https://cloud.google.com/architecture/data-lifecycle-cloud-platform):** Understanding how cloud providers handle different data types.
95+
96+
97+
* **[MongoDB University](https://university.mongodb.com/):** Learning the difference between Document stores and SQL.
98+
99+
---
100+
101+
Finding the data is only the first step. Once we have access, we need to move it into our systems without losing information or causing bottlenecks.
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
---
2+
title: "SQL vs. NoSQL for ML"
3+
sidebar_label: SQL & NoSQL
4+
description: "Comparing Relational and Non-Relational databases: choosing the right storage for your machine learning features and labels."
5+
tags: [databases, sql, nosql, data-engineering, postgres, mongodb]
6+
---
7+
8+
Choosing between a **SQL (Relational)** and a **NoSQL (Non-Relational)** database is one of the most critical decisions in a Data Engineering pipeline. In Machine Learning, this choice often depends on whether your data is fixed and structured or evolving and unstructured.
9+
10+
## 1. The Architectural Divide
11+
12+
```mermaid
13+
graph TD
14+
subgraph SQL ["SQL (Relational)"]
15+
Table[Tables/Rows] --- Schema[Strict Schema]
16+
end
17+
subgraph NoSQL ["NoSQL (Non-Relational)"]
18+
Doc[Documents/Key-Value] --- Flexible[Dynamic Schema]
19+
end
20+
style SQL fill:#e3f2fd,stroke:#1565c0,color:#333
21+
style NoSQL fill:#f1f8e9,stroke:#33691e,color:#333
22+
23+
```
24+
25+
## 2. SQL: Relational Databases
26+
27+
**Examples:** PostgreSQL, MySQL, SQLite, Oracle.
28+
29+
SQL databases store data in rows and columns. They are built on **ACID** properties (Atomicity, Consistency, Isolation, Durability), ensuring that every transaction is processed reliably.
30+
31+
* **Best for:** Structured data where relationships are key (e.g., linking a `User_ID` to `Transactions` and `Product_Details`).
32+
* **Scaling:** Vertically (buying a bigger, more powerful server).
33+
* **ML Use Case:** Serving as the "Source of Truth" for historical training data where data integrity is paramount.
34+
35+
## 3. NoSQL: Non-Relational Databases
36+
37+
**Examples:** MongoDB (Document), Cassandra (Column-family), Redis (Key-Value), Neo4j (Graph).
38+
39+
NoSQL databases are designed for distributed data and high-speed horizontal scaling. They are often **BASE** compliant (Basically Available, Soft state, Eventual consistency).
40+
41+
* **Best for:** Unstructured or semi-structured data (JSON, social media feeds, sensor logs).
42+
* **Scaling:** Horizontally (adding more cheap servers to a cluster).
43+
* **ML Use Case:** * **Feature Stores:** Using Redis for ultra-fast lookup of features during real-time inference.
44+
* **Unstructured Storage:** Using MongoDB to store raw JSON metadata for NLP tasks.
45+
46+
## 4. Key Differences Comparison
47+
48+
| Feature | SQL | NoSQL |
49+
| --- | --- | --- |
50+
| **Data Model** | Tabular (Rows/Columns) | Document, Key-Value, Graph |
51+
| **Schema** | Fixed (Pre-defined) | Dynamic (On-the-fly) |
52+
| **Joins** | Very efficient ($$JOIN$$) | Generally avoided (Data is denormalized) |
53+
| **Query Language** | Structured Query Language (SQL) | Varies (e.g., MQL for MongoDB) |
54+
| **Standard** | ACID | BASE |
55+
56+
## 5. CAP Theorem: The Data Engineer's Trade-off
57+
58+
When choosing a database for a distributed ML system, you must consider the **CAP Theorem**. It states that a distributed system can only provide two out of the following three:
59+
60+
```mermaid
61+
pie
62+
title CAP Theorem
63+
"Consistency" : 1
64+
"Availability" : 1
65+
"Partition Tolerance" : 1
66+
```
67+
68+
1. **Consistency:** Every read receives the most recent write.
69+
2. **Availability:** Every request receives a response (even if it's not the latest).
70+
3. **Partition Tolerance:** The system continues to operate despite network failures.
71+
72+
## 6. Hybrid Approaches: The "Polyglot" Strategy
73+
74+
Modern ML architectures rarely use just one.
75+
76+
* **Postgres (SQL)** might store the user account and labels.
77+
* **MongoDB (NoSQL)** might store the raw log data.
78+
* **S3 (Object Store)** might store the actual trained `.pkl` or `.onnx` model files.
79+
80+
## References for More Details
81+
82+
* **[PostgreSQL Documentation](https://www.postgresql.org/docs/):** Learning about complex joins and indexing for speed.
83+
84+
* **[MongoDB Architecture Guide](https://www.mongodb.com/docs/manual/core/data-modeling-introduction/):** Understanding document-based data modeling.
85+
86+
---
87+
88+
Storing data is one thing; getting it into your system is another. Let's look at how we build the bridges between these databases and our models.

0 commit comments

Comments
 (0)