codeharborhub
diff --git a/‎docs/machine-learning/data-engineering-basics/data-collection/apis.mdx‎
Lines changed: 125 additions & 0 deletions b/‎docs/machine-learning/data-engineering-basics/data-collection/apis.mdx‎
Lines changed: 125 additions & 0 deletions
diff --git a/‎docs/machine-learning/data-engineering-basics/data-collection/data-sources.mdx‎
Lines changed: 101 additions & 0 deletions b/‎docs/machine-learning/data-engineering-basics/data-collection/data-sources.mdx‎
Lines changed: 101 additions & 0 deletions
diff --git a/‎docs/machine-learning/data-engineering-basics/data-collection/databases-sql-nosql.mdx‎
Lines changed: 88 additions & 0 deletions b/‎docs/machine-learning/data-engineering-basics/data-collection/databases-sql-nosql.mdx‎
Lines changed: 88 additions & 0 deletions
@@ -0,0 +1,125 @@
+---
+title: Mastering APIs for Data Collection
+sidebar_label: APIs
+description: "A deep dive into REST and GraphQL APIs: how to fetch, authenticate, and process external data for machine learning."
+tags: [apis, rest, graphql, json, data-engineering, python-requests]
+---
+
+In the Data Engineering lifecycle, **APIs** are the "clean" way to collect data. Unlike web scraping, which is brittle and unstructured, APIs provide a contract-based method to access data that is versioned, documented, and usually delivered in machine-readable formats like JSON.
+
+## 1. How APIs Work: The Request-Response Cycle
+
+An API acts as a middleman between your ML pipeline and a remote server. You send a **Request** (a specific question) and receive a **Response** (the data answer).
+
+```mermaid
+sequenceDiagram
+    participant Pipeline as ML Data Pipeline
+    participant API as API Gateway
+    participant Server as Data Server
+    
+    Pipeline->>API: HTTP Request (GET /data)
+    Note right of Pipeline: Includes Headers & API Key
+    API->>Server: Validate & Route
+    Server-->>API: Data Payload
+    API-->>Pipeline: HTTP Response (200 OK + JSON)
+
+```
+
+### Components of an API Request:
+
+1. **Endpoint (URL):** The address where the data lives (e.g., `api.twitter.com/v2/tweets`).
+2. **Method:** What you want to do (`GET` to fetch, `POST` to send).
+3. **Headers:** Metadata like your **API Key** or the format you want (`Content-Type: application/json`).
+4. **Parameters:** Filters for the data (e.g., `?start_date=2023-01-01`).
+
+## 2. Common API Architectures in ML
+
+### A. REST (Representational State Transfer)
+
+The most common architecture. It treats every piece of data as a "Resource."
+
+* **Best for:** Standardized data fetching.
+* **Format:** Almost exclusively **JSON**.
+
+### B. GraphQL
+
+Developed by Meta, it allows the client to define the structure of the data it needs.
+
+* **Advantage in ML:** If a user profile has 100 fields but you only need 3 features for your model, GraphQL prevents "Over-fetching," saving bandwidth and memory.
+
+[Image comparing REST vs GraphQL data fetching efficiency]
+
+### C. Streaming APIs (WebSockets/gRPC)
+
+Used when data needs to be delivered in real-time.
+
+* **ML Use Case:** Algorithmic trading or live social media sentiment monitoring.
+
+## 3. Implementation in Python
+
+The `requests` library is the standard tool for interacting with APIs.
+
+```python
+import requests
+
+url = "https://api.example.com/v1/weather"
+headers = {
+    "Authorization": "Bearer YOUR_TOKEN"
+}
+params = {
+    "city": "Mandsaur",
+    "country": "IN",
+    "units": "metric"
+}
+
+response = requests.get(url, headers=headers, params=params)
+
+if response.status_code == 200:
+    data = response.json()
+    temperature = data["main"]["temp"]   # Extracting temperature
+    humidity = data["main"]["humidity"] # Extracting humidity
+
+    print(f"Temperature in Mandsaur: {temperature}°C")
+    print(f"Humidity: {humidity}%")
+else:
+    print("Failed to fetch weather data")
+
+```
+
+## 4. Challenges: Rate Limiting and Status Codes
+
+APIs are not infinite resources. Providers implement **Rate Limiting** to prevent abuse.
+
+| Status Code | Meaning | Action for ML Pipeline |
+| --- | --- | --- |
+| **200** | OK | Process the data. |
+| **401** | Unauthorized | Check your API Key/Token. |
+| **404** | Not Found | Check your Endpoint URL. |
+| **429** | Too Many Requests | **Exponential Backoff:** Wait and try again later. |
+
+```mermaid
+flowchart TD
+    Req[Send API Request] --> Res{Status Code?}
+    Res -- 200 --> Save[Ingest to Database]
+    Res -- 429 --> Wait[Wait/Sleep] --> Req
+    Res -- 401 --> Fail[Alert Developer]
+    style Wait fill:#fff3e0,stroke:#ef6c00,color:#333
+
+```
+
+## 5. Authentication Methods
+
+1. **API Keys:** A simple string passed in the header.
+2. **OAuth 2.0:** A more secure, token-based system used by Google, Meta, and Twitter.
+3. **JWT (JSON Web Tokens):** Often used in internal microservices.
+
+## References for More Details
+
+* **[REST API Tutorial](https://restfulapi.net/):** Understanding the principles of RESTful design.
+
+
+* **[Python Requests Guide](https://requests.readthedocs.io/en/latest/):** Mastering HTTP requests for data collection.
+
+---
+
+APIs give us structured data, but sometimes the "front door" is locked. When there is no API, we must use the more aggressive "side window" approach.
@@ -0,0 +1,101 @@
+---
+title: Data Sources in ML
+sidebar_label: Data Sources
+description: "Identifying and integrating various data sources: from relational databases and APIs to unstructured web data and IoT streams."
+tags: [data-engineering, data-sources, sql, nosql, apis, web-scraping]
+---
+
+Data is the "fuel" for Machine Learning. However, this fuel is rarely found in one place. As a data engineer, your job is to identify where the raw data lives and how to transport it safely into your environment for processing.
+
+## 1. The Data Source Landscape
+
+We generally categorize data sources based on their **Structure** and their **Storage Method**.
+
+```mermaid
+graph TD
+    Root[Data Sources] --> Structured[Structured]
+    Root --> Semi[Semi-Structured]
+    Root --> Unstructured[Unstructured]
+    
+    Structured --> SQL[Relational DBs: MySQL, Postgres]
+    Semi --> Files[JSON, XML, Parquet]
+    Unstructured --> Media[Images, Video, Audio, PDF]
+
+```
+
+## 2. Common Data Sources
+
+### A. Relational Databases (SQL)
+
+The most common source for tabular data (customer records, transactions).
+
+* **Protocol:** SQL (Structured Query Language).
+* **Pros:** Highly reliable (ACID compliant), easy to join tables.
+* **Cons:** Hard to scale horizontally; requires a fixed schema.
+
+### B. NoSQL Databases
+
+Used for high-volume, high-velocity, or non-tabular data.
+
+* **Key-Value Stores:** Redis.
+* **Document Stores:** MongoDB (Stores data as JSON/BSON).
+* **ML Use Case:** Storing user profiles or real-time feature stores.
+
+### C. APIs (Application Programming Interfaces)
+
+Used to pull data from external services like Twitter, Google Maps, or Financial markets.
+
+* **Format:** Usually **JSON** or **REST**.
+* **Challenges:** Rate limiting (you can only pull so much data per hour) and authentication.
+
+### D. Cloud Object Storage (The Data Lake)
+
+Services like **AWS S3** or **Google Cloud Storage** act as a dumping ground for raw files before they are processed.
+
+* **ML Use Case:** Storing millions of images for a Computer Vision model.
+
+## 3. Batch vs. Streaming Sources
+
+How the data arrives at your model is just as important as where it comes from.
+
+| Feature | Batch Processing | Stream Processing |
+| --- | --- | --- |
+| **Source** | Databases, CSV files, Data Lakes | Kafka, Kinesis, IoT Sensors |
+| **Frequency** | Hourly, Daily, Weekly | Real-time (Milliseconds) |
+| **Use Case** | Training a model on historical sales | Predicting fraud during a transaction |
+
+```mermaid
+flowchart LR
+    S1[(Database)] -->|Batch| B[ETL Process]
+    S2{{IoT Sensor}} -->|Stream| P[Real-time Pipeline]
+    B --> DL[Data Lake]
+    P --> DL
+    style P fill:#fff3e0,stroke:#ef6c00,color:#333
+
+```
+
+## 4. Web Scraping & Crawling
+
+When data isn't available via API or DB, we use scrapers (like `BeautifulSoup` or `Scrapy`) to extract information from HTML.
+
+* **Ethics Check:** Always check a site's `robots.txt` before scraping to ensure you are legally and ethically allowed to take the data.
+
+## 5. Identifying High-Quality Sources
+
+Not all data sources are equal. When evaluating a source for an ML project, ask:
+
+1. **Freshness:** How often is this data updated?
+2. **Reliability:** Does the source go down often?
+3. **Completeness:** Does it have missing values ()?
+4. **Granularity:** Is the data at the level we need (e.g., individual transactions vs. daily totals)?
+
+## References for More Details
+
+* **[Google Cloud - Data Source Types](https://cloud.google.com/architecture/data-lifecycle-cloud-platform):** Understanding how cloud providers handle different data types.
+
+
+* **[MongoDB University](https://university.mongodb.com/):** Learning the difference between Document stores and SQL.
+
+---
+
+Finding the data is only the first step. Once we have access, we need to move it into our systems without losing information or causing bottlenecks.
@@ -0,0 +1,88 @@
+---
+title: "SQL vs. NoSQL for ML"
+sidebar_label: SQL & NoSQL
+description: "Comparing Relational and Non-Relational databases: choosing the right storage for your machine learning features and labels."
+tags: [databases, sql, nosql, data-engineering, postgres, mongodb]
+---
+
+Choosing between a **SQL (Relational)** and a **NoSQL (Non-Relational)** database is one of the most critical decisions in a Data Engineering pipeline. In Machine Learning, this choice often depends on whether your data is fixed and structured or evolving and unstructured.
+
+## 1. The Architectural Divide
+
+```mermaid
+graph TD
+    subgraph SQL ["SQL (Relational)"]
+        Table[Tables/Rows] --- Schema[Strict Schema]
+    end
+    subgraph NoSQL ["NoSQL (Non-Relational)"]
+        Doc[Documents/Key-Value] --- Flexible[Dynamic Schema]
+    end
+    style SQL fill:#e3f2fd,stroke:#1565c0,color:#333
+    style NoSQL fill:#f1f8e9,stroke:#33691e,color:#333
+
+```
+
+## 2. SQL: Relational Databases
+
+**Examples:** PostgreSQL, MySQL, SQLite, Oracle.
+
+SQL databases store data in rows and columns. They are built on **ACID** properties (Atomicity, Consistency, Isolation, Durability), ensuring that every transaction is processed reliably.
+
+* **Best for:** Structured data where relationships are key (e.g., linking a `User_ID` to `Transactions` and `Product_Details`).
+* **Scaling:** Vertically (buying a bigger, more powerful server).
+* **ML Use Case:** Serving as the "Source of Truth" for historical training data where data integrity is paramount.
+
+## 3. NoSQL: Non-Relational Databases
+
+**Examples:** MongoDB (Document), Cassandra (Column-family), Redis (Key-Value), Neo4j (Graph).
+
+NoSQL databases are designed for distributed data and high-speed horizontal scaling. They are often **BASE** compliant (Basically Available, Soft state, Eventual consistency).
+
+* **Best for:** Unstructured or semi-structured data (JSON, social media feeds, sensor logs).
+* **Scaling:** Horizontally (adding more cheap servers to a cluster).
+* **ML Use Case:** * **Feature Stores:** Using Redis for ultra-fast lookup of features during real-time inference.
+* **Unstructured Storage:** Using MongoDB to store raw JSON metadata for NLP tasks.
+
+## 4. Key Differences Comparison
+
+| Feature | SQL | NoSQL |
+| --- | --- | --- |
+| **Data Model** | Tabular (Rows/Columns) | Document, Key-Value, Graph |
+| **Schema** | Fixed (Pre-defined) | Dynamic (On-the-fly) |
+| **Joins** | Very efficient ($$JOIN$$) | Generally avoided (Data is denormalized) |
+| **Query Language** | Structured Query Language (SQL) | Varies (e.g., MQL for MongoDB) |
+| **Standard** | ACID | BASE |
+
+## 5. CAP Theorem: The Data Engineer's Trade-off
+
+When choosing a database for a distributed ML system, you must consider the **CAP Theorem**. It states that a distributed system can only provide two out of the following three:
+
+```mermaid
+pie
+    title CAP Theorem
+    "Consistency" : 1
+    "Availability" : 1
+    "Partition Tolerance" : 1
+```
+
+1. **Consistency:** Every read receives the most recent write.
+2. **Availability:** Every request receives a response (even if it's not the latest).
+3. **Partition Tolerance:** The system continues to operate despite network failures.
+
+## 6. Hybrid Approaches: The "Polyglot" Strategy
+
+Modern ML architectures rarely use just one.
+
+* **Postgres (SQL)** might store the user account and labels.
+* **MongoDB (NoSQL)** might store the raw log data.
+* **S3 (Object Store)** might store the actual trained `.pkl` or `.onnx` model files.
+
+## References for More Details
+
+* **[PostgreSQL Documentation](https://www.postgresql.org/docs/):** Learning about complex joins and indexing for speed.
+
+* **[MongoDB Architecture Guide](https://www.mongodb.com/docs/manual/core/data-modeling-introduction/):** Understanding document-based data modeling.
+
+---
+
+Storing data is one thing; getting it into your system is another. Let's look at how we build the bridges between these databases and our models.