
Your quarterly revenue dashboard is pulling from six different systems, and three of them don't offer direct database access. Instead, you're stuck copying data from web portals and wrestling with CSV exports that break every time the vendor updates their interface. Meanwhile, your competitor is releasing market analysis reports twice as fast as your team, and you suspect they've automated their data collection from the same public APIs you're manually checking each week.
This scenario plays out in organizations everywhere. While Power Query excels at connecting to databases and files, the real competitive advantage comes from automating data collection from web sources—APIs that provide real-time market data, customer portals that require authentication, and dynamic web pages that generate reports on demand. Mastering these techniques transforms you from a data processor into a data intelligence architect.
By the end of this lesson, you'll be building robust, automated pipelines that pull data from any web source, handle authentication complexities, and maintain themselves even when APIs change their structure.
What you'll learn:
You should be comfortable with Power Query's basic data transformation operations and M language syntax. Experience with HTTP concepts (headers, status codes, authentication) and basic understanding of JSON/XML structures will be helpful but not required.
Power Query approaches web data through two primary connectors: Web.Contents() for APIs and structured web requests, and Web.Page() for scraping HTML content. While these might seem similar, they operate at different levels of the web stack and require distinct strategies.
Web.Contents() works at the HTTP protocol level, sending requests directly to endpoints and receiving structured responses. This makes it ideal for REST APIs, web services, and any scenario where you're requesting data rather than presentation markup. Web.Page() processes complete HTML documents, parsing them into navigable table structures—perfect for extracting data from human-readable web pages.
The architectural distinction matters because it determines your error handling strategy, refresh behavior, and performance characteristics. API connections typically fail fast with specific error codes, while web scraping fails gradually as page structures change.
Let's start with a real-world scenario: connecting to the GitHub API to analyze repository activity across your organization. This example demonstrates core concepts without requiring authentication initially.
let
// Define the base URL and endpoint
BaseUrl = "https://api.github.com",
Endpoint = "/repos/microsoft/powerquery-m/commits",
// Make the initial request
Source = Json.Document(Web.Contents(BaseUrl & Endpoint)),
// Convert to table for easier manipulation
CommitsTable = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
// Expand the JSON structure
ExpandedCommits = Table.ExpandRecordColumn(CommitsTable, "Column1",
{"sha", "commit", "author", "committer"})
in
ExpandedCommits
This basic pattern—request, convert, expand—forms the foundation of most API connections. But real-world APIs require more sophistication.
APIs rarely accept bare requests. They expect parameters, authentication headers, and specific content types. Here's how to structure these properly:
let
// API configuration
BaseUrl = "https://api.github.com",
Repository = "microsoft/powerquery-m",
// Build query parameters
QueryParams = [
#"since" = "2024-01-01T00:00:00Z",
#"per_page" = "100",
#"page" = "1"
],
// Construct headers
Headers = [
#"Accept" = "application/vnd.github.v3+json",
#"User-Agent" = "PowerQuery-DataPipeline/1.0"
],
// Build the complete URL with parameters
Url = BaseUrl & "/repos/" & Repository & "/commits?" &
Uri.BuildQueryString(QueryParams),
// Make the request with headers
Source = Json.Document(Web.Contents(Url, [Headers=Headers])),
// Process the response
ProcessedData = Table.FromList(Source,
Splitter.SplitByNothing(), null, null, ExtraValues.Error)
in
ProcessedData
Performance Tip: Power Query caches Web.Contents() responses based on the complete URL and headers. Slight variations in parameter formatting can break caching, so standardize your URL construction patterns.
Most business APIs require authentication. Power Query supports several methods, each with specific implementation patterns.
The simplest approach uses API keys in headers or query parameters:
let
// Store sensitive data in parameters
ApiKey = "your-api-key-here", // In production, use Power Query parameters
Headers = [
#"Authorization" = "Bearer " & ApiKey,
#"Content-Type" = "application/json"
],
Source = Json.Document(Web.Contents(
"https://api.example.com/data",
[Headers = Headers]
))
in
Source
Security Warning: Never hardcode API keys in queries. Use Power Query parameters and configure them at the dataset level to keep credentials secure.
OAuth requires a more complex approach since Power Query can't handle interactive authentication flows directly:
let
// Pre-obtained access token (from your OAuth flow)
AccessToken = "your-oauth-token",
Headers = [
#"Authorization" = "Bearer " & AccessToken,
#"Accept" = "application/json"
],
// Function to refresh token when expired
RefreshToken = (refresh_token as text) =>
let
TokenUrl = "https://oauth.example.com/token",
Body = "grant_type=refresh_token&refresh_token=" & refresh_token,
Response = Json.Document(Web.Contents(TokenUrl, [
Headers = [#"Content-Type" = "application/x-www-form-urlencoded"],
Content = Text.ToBinary(Body)
])),
NewToken = Response[access_token]
in
NewToken,
// Main request with error handling
Source = try Json.Document(Web.Contents(
"https://api.example.com/data",
[Headers = Headers]
)) otherwise RefreshToken("your-refresh-token")
in
Source
For production OAuth implementations, consider building a custom connector or using Azure Function proxies to manage token lifecycle outside Power Query.
APIs limit response sizes through pagination. Power Query's functional approach makes building pagination logic elegant but requires understanding recursive patterns.
Here's a robust pagination implementation for APIs that use page-based pagination:
let
// Pagination function
GetAllPages = (baseUrl as text, initialPage as number) =>
let
// Recursive function to fetch pages
GetPage = (pageNumber as number, accumulated as list) =>
let
Url = baseUrl & "?page=" & Number.ToText(pageNumber) & "&per_page=100",
Response = try Json.Document(Web.Contents(Url)) otherwise null,
// Check if we got data
HasData = Response <> null and List.Count(Response) > 0,
// Add current page to accumulated results
NewAccumulated = if HasData then
accumulated & Response
else
accumulated,
// Decide whether to continue
Result = if HasData and List.Count(Response) = 100 then
@GetPage(pageNumber + 1, NewAccumulated)
else
NewAccumulated
in
Result,
// Start the recursion
AllData = GetPage(initialPage, {})
in
AllData,
// Usage
Source = GetAllPages("https://api.github.com/repos/microsoft/powerquery-m/commits", 1),
// Convert to table
Table = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error)
in
Table
Many modern APIs use cursor-based pagination for better performance:
let
GetAllRecords = (baseUrl as text, optional cursor as text) =>
let
// Build URL with cursor
Url = if cursor = null then
baseUrl
else
baseUrl & "?cursor=" & cursor,
Response = Json.Document(Web.Contents(Url)),
// Extract data and pagination info
Data = Response[data],
NextCursor = try Response[pagination][next_cursor] otherwise null,
HasMore = NextCursor <> null,
// Recursive call if more data exists
Result = if HasMore then
Data & @GetAllRecords(baseUrl, NextCursor)
else
Data
in
Result,
Source = GetAllRecords("https://api.example.com/records"),
Table = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error)
in
Table
Rate Limiting Consideration: Aggressive pagination can trigger API rate limits. Consider adding delays using
Function.InvokeAfter()in production scenarios.
When APIs aren't available, web scraping becomes necessary. Power Query's HTML parsing capabilities are sophisticated but require understanding the document object model.
Start with a simple scenario—extracting financial data from a public website:
let
// Load the web page
Source = Web.Page(Web.Contents("https://finance.yahoo.com/quote/MSFT/history")),
// Power Query automatically detects tables
Tables = Source{[Name="Tables"]}[Data],
// Select the specific table (usually requires inspection)
HistoryTable = Tables{0}[Data],
// Clean up the headers
PromotedHeaders = Table.PromoteHeaders(HistoryTable, [PromoteAllScalars=true]),
// Type the columns appropriately
TypedTable = Table.TransformColumnTypes(PromotedHeaders, {
{"Date", type date},
{"Open", type number},
{"High", type number},
{"Low", type number},
{"Close*", type number},
{"Adj Close**", type number},
{"Volume", Int64.Type}
})
in
TypedTable
For more complex pages, you need granular control over element selection. While Power Query doesn't support CSS selectors directly, you can navigate the HTML structure programmatically:
let
// Load page content
Source = Web.Page(Web.Contents("https://example.com/data-page")),
// Navigate to specific elements
Body = Source{[Name="Body"]}[Data],
// Function to find elements by class
FindByClass = (table as table, className as text) =>
let
Filtered = Table.SelectRows(table,
each try Text.Contains([ClassName], className) otherwise false)
in
Filtered,
// Extract data containers
DataContainers = FindByClass(Body, "data-container"),
// Process each container
ProcessContainer = (row as record) =>
let
// Extract nested data
NestedData = row[Data],
// Find specific data points
Values = Table.SelectRows(NestedData,
each [TagName] = "span" and Text.Contains([ClassName], "data-value")),
// Extract text content
ExtractedValues = Table.AddColumn(Values, "Value",
each try [TextContent] otherwise "")
in
ExtractedValues,
// Apply processing to all containers
ProcessedData = Table.AddColumn(DataContainers, "ProcessedData",
each ProcessContainer(_))
in
ProcessedData
Power Query's Web.Page() function can't execute JavaScript, which limits its effectiveness on modern single-page applications. However, you can often find the underlying API calls that populate the dynamic content:
let
// Strategy 1: Find the API endpoint used by the JavaScript
// Use browser developer tools to identify XHR requests
ApiEndpoint = "https://example.com/api/data-endpoint",
// Strategy 2: Look for JSON data in script tags
Source = Web.Page(Web.Contents("https://example.com/dynamic-page")),
// Find script tags containing data
Scripts = Table.SelectRows(Source{[Name="Body"]}[Data],
each [TagName] = "script"),
// Extract JSON from script content
ExtractJson = (scriptContent as text) =>
let
// Find JSON pattern (adjust regex as needed)
JsonStart = Text.PositionOf(scriptContent, "window.initialData = "),
JsonEnd = Text.PositionOf(scriptContent, ";</script>"),
JsonText = Text.Middle(scriptContent,
JsonStart + Text.Length("window.initialData = "),
JsonEnd - JsonStart - Text.Length("window.initialData = ")),
ParsedJson = try Json.Document(JsonText) otherwise null
in
ParsedJson,
// Process all scripts to find embedded data
ProcessedScripts = Table.AddColumn(Scripts, "ParsedData",
each ExtractJson([TextContent]))
in
ProcessedScripts
Web data sources are inherently unreliable. Network issues, API changes, and server maintenance create constant failure scenarios. Building resilient pipelines requires sophisticated error handling.
Power Query's try...otherwise construct handles exceptions, but web APIs communicate errors through HTTP status codes, empty responses, and malformed data:
let
// Robust API request function
SafeApiRequest = (url as text, optional options as record) =>
let
// Default options
DefaultOptions = [
Headers = [#"User-Agent" = "PowerQuery/1.0"],
Timeout = #duration(0, 0, 2, 0),
ManualStatusHandling = {400, 401, 403, 404, 429, 500, 502, 503}
],
// Merge provided options
FinalOptions = Record.Combine({DefaultOptions, options ?? []}),
// Make the request
Response = try Web.Contents(url, FinalOptions) otherwise null,
// Check response validity
ValidResponse = if Response = null then
[Success = false, Error = "Network request failed", Data = null]
else
let
// Get response metadata
Metadata = Value.Metadata(Response),
StatusCode = Metadata[Response.Status]?,
// Evaluate response
Result = if StatusCode >= 200 and StatusCode < 300 then
[Success = true, Error = null, Data = Response]
else if StatusCode = 429 then
[Success = false, Error = "Rate limited", Data = null, Retry = true]
else if StatusCode >= 500 then
[Success = false, Error = "Server error: " & Number.ToText(StatusCode),
Data = null, Retry = true]
else
[Success = false, Error = "Client error: " & Number.ToText(StatusCode),
Data = null]
in
Result
in
ValidResponse,
// Usage with retry logic
RequestWithRetry = (url as text, maxRetries as number) =>
let
AttemptRequest = (attempt as number) =>
let
Response = SafeApiRequest(url),
ShouldRetry = Response[Retry]? = true and attempt < maxRetries,
Result = if Response[Success] then
Response
else if ShouldRetry then
@AttemptRequest(attempt + 1)
else
Response
in
Result
in
AttemptRequest(1),
// Test the function
Source = RequestWithRetry("https://api.github.com/repos/microsoft/powerquery-m", 3)
in
Source
APIs evolve, and schema changes break downstream processes. Implement validation at the point of data ingestion:
let
// Expected schema definition
ExpectedSchema = [
RequiredFields = {"id", "name", "created_at"},
FieldTypes = [
id = Int64.Type,
name = Text.Type,
created_at = DateTime.Type
]
],
// Schema validation function
ValidateSchema = (data as table, schema as record) =>
let
// Check required fields exist
ActualColumns = Table.ColumnNames(data),
MissingFields = List.Difference(schema[RequiredFields], ActualColumns),
HasMissingFields = List.Count(MissingFields) > 0,
// Validate field types (simplified)
TypeValidation = if HasMissingFields then
[Valid = false, Errors = {"Missing fields: " & Text.Combine(MissingFields, ", ")}]
else
[Valid = true, Errors = {}],
Result = [
IsValid = TypeValidation[Valid],
Errors = TypeValidation[Errors],
Data = if TypeValidation[Valid] then data else null
]
in
Result,
// API request with validation
Source = Json.Document(Web.Contents("https://api.example.com/data")),
DataTable = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
// Expand and validate
ExpandedData = Table.ExpandRecordColumn(DataTable, "Column1",
{"id", "name", "created_at"}),
ValidationResult = ValidateSchema(ExpandedData, ExpectedSchema),
// Return validated data or error
FinalResult = if ValidationResult[IsValid] then
ValidationResult[Data]
else
error "Schema validation failed: " & Text.Combine(ValidationResult[Errors], "; ")
in
FinalResult
Web data connections are expensive operations. Optimizing performance requires understanding Power Query's execution model and implementing intelligent caching.
Unlike database connectors, web sources rarely support query folding. Every transformation happens in Power Query's engine, making data volume management critical:
let
// Inefficient: Download everything, then filter
Source = Json.Document(Web.Contents("https://api.example.com/large-dataset")),
AllData = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
FilteredData = Table.SelectRows(AllData, each [status] = "active"),
// Efficient: Filter at the API level
FilteredUrl = "https://api.example.com/large-dataset?status=active",
EfficientSource = Json.Document(Web.Contents(FilteredUrl)),
EfficientData = Table.FromList(EfficientSource, Splitter.SplitByNothing(), null, null, ExtraValues.Error)
in
EfficientData
For frequently updated web sources, implement incremental refresh to minimize data transfer:
let
// Get the last refresh timestamp from previous execution
LastRefresh = try DateTime.From(#"Last Refresh Parameter") otherwise #datetime(2024, 1, 1, 0, 0, 0),
// Convert to API format
SinceParam = DateTime.ToText(LastRefresh, "yyyy-MM-ddThh:mm:ssZ"),
// Request only new data
IncrementalUrl = "https://api.example.com/data?since=" & SinceParam,
Source = Json.Document(Web.Contents(IncrementalUrl)),
NewData = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
// Add metadata for next refresh
CurrentTime = DateTime.UtcNow(),
DataWithTimestamp = Table.AddColumn(NewData, "RefreshTime", each CurrentTime)
in
DataWithTimestamp
When collecting data from multiple related endpoints, implement parallel processing to improve performance:
let
// List of endpoints to process
Endpoints = {
"https://api.example.com/dataset1",
"https://api.example.com/dataset2",
"https://api.example.com/dataset3",
"https://api.example.com/dataset4"
},
// Function to process a single endpoint
ProcessEndpoint = (url as text) =>
let
Source = try Json.Document(Web.Contents(url)) otherwise {},
Table = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
WithSource = Table.AddColumn(Table, "SourceUrl", each url)
in
WithSource,
// Process all endpoints (Power Query handles parallelization)
ProcessedEndpoints = List.Transform(Endpoints, ProcessEndpoint),
// Combine results
CombinedData = Table.Combine(ProcessedEndpoints)
in
CombinedData
Moving from experimentation to production requires addressing enterprise concerns: security, monitoring, governance, and maintainability.
Create flexible pipelines that work across development, staging, and production environments:
let
// Environment-specific configuration
Config = [
Development = [
BaseUrl = "https://api-dev.example.com",
RateLimit = 10,
Timeout = 30
],
Production = [
BaseUrl = "https://api.example.com",
RateLimit = 100,
Timeout = 60
]
],
// Get current environment (from parameter)
Environment = try #"Environment Parameter" otherwise "Development",
CurrentConfig = Record.Field(Config, Environment),
// Configurable API client
ApiClient = [
BaseUrl = CurrentConfig[BaseUrl],
MakeRequest = (endpoint as text, optional params as record) =>
let
// Build URL with parameters
QueryString = if params = null then ""
else "?" & Uri.BuildQueryString(params),
FullUrl = CurrentConfig[BaseUrl] & endpoint & QueryString,
// Standard headers
Headers = [
#"User-Agent" = "DataPipeline/1.0",
#"Accept" = "application/json"
],
// Make request with timeout
Response = Web.Contents(FullUrl, [
Headers = Headers,
Timeout = #duration(0, 0, 0, CurrentConfig[Timeout])
])
in
Json.Document(Response)
],
// Usage
Source = ApiClient[MakeRequest]("/data", [limit = 100])
in
Source
Implement monitoring to track API health and data quality:
let
// Monitoring function
MonitorApiHealth = (url as text) =>
let
StartTime = DateTime.UtcNow(),
Response = try [
Success = true,
Data = Json.Document(Web.Contents(url)),
ResponseTime = Duration.TotalSeconds(DateTime.UtcNow() - StartTime),
Timestamp = StartTime
] otherwise [
Success = false,
Error = "Request failed",
ResponseTime = Duration.TotalSeconds(DateTime.UtcNow() - StartTime),
Timestamp = StartTime
],
// Log metrics (in real implementation, send to monitoring system)
LogEntry = [
Url = url,
Success = Response[Success],
ResponseTime = Response[ResponseTime],
Timestamp = Response[Timestamp],
RecordCount = try List.Count(Response[Data]) otherwise 0
]
in
LogEntry,
// Monitor multiple endpoints
EndpointsToMonitor = {
"https://api.example.com/health",
"https://api.example.com/data"
},
HealthChecks = List.Transform(EndpointsToMonitor, MonitorApiHealth),
HealthTable = Table.FromList(HealthChecks, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
ExpandedHealth = Table.ExpandRecordColumn(HealthTable, "Column1",
{"Url", "Success", "ResponseTime", "Timestamp", "RecordCount"})
in
ExpandedHealth
Maintain clear documentation about data sources and transformations:
let
// Data lineage metadata
LineageInfo = [
SourceSystem = "External API",
ApiVersion = "v1",
Endpoint = "https://api.example.com/data",
LastModified = DateTime.UtcNow(),
Schema = [
Fields = {"id", "name", "created_at", "status"},
Types = ["number", "text", "datetime", "text"]
],
Transformations = {
"Filter active records",
"Parse datetime fields",
"Add data quality flags"
}
],
// Main data processing
Source = Json.Document(Web.Contents("https://api.example.com/data")),
DataTable = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
// Add lineage metadata to table
WithLineage = Table.AddColumn(DataTable, "DataLineage", each LineageInfo)
in
WithLineage
Enterprise scenarios often require integrating multiple web sources or combining web data with other systems.
Combine data from multiple APIs with different schemas and update frequencies:
let
// Source 1: Customer data from CRM API
CrmData = let
Source = Json.Document(Web.Contents("https://crm-api.example.com/customers")),
Table = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
Expanded = Table.ExpandRecordColumn(Table, "Column1", {"id", "name", "email"}),
WithSource = Table.AddColumn(Expanded, "Source", each "CRM")
in WithSource,
// Source 2: Activity data from marketing platform
MarketingData = let
Source = Json.Document(Web.Contents("https://marketing-api.example.com/activities")),
Table = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
Expanded = Table.ExpandRecordColumn(Table, "Column1", {"customer_id", "activity", "timestamp"}),
WithSource = Table.AddColumn(Expanded, "Source", each "Marketing")
in WithSource,
// Standardize schemas
StandardizedCrm = Table.RenameColumns(CrmData, {{"id", "customer_id"}}),
// Join datasets
JoinedData = Table.NestedJoin(StandardizedCrm, {"customer_id"},
MarketingData, {"customer_id"}, "MarketingActivities", JoinKind.LeftOuter),
// Expand nested activities
ExpandedActivities = Table.ExpandTableColumn(JoinedData, "MarketingActivities",
{"activity", "timestamp"})
in
ExpandedActivities
While Power Query doesn't support true streaming, you can simulate real-time updates for dashboards:
let
// Function to get current data with timestamp
GetCurrentData = () =>
let
CurrentTime = DateTime.UtcNow(),
// Fetch latest data
Source = Json.Document(Web.Contents("https://api.example.com/live-data")),
// Add timestamp to each record
WithTimestamp = List.Transform(Source,
each Record.AddField(_, "FetchTime", CurrentTime)),
// Convert to table
Table = Table.FromList(WithTimestamp, Splitter.SplitByNothing(), null, null, ExtraValues.Error)
in
Table,
// Get current snapshot
CurrentData = GetCurrentData(),
// In a real streaming scenario, you might:
// 1. Store historical data in a separate table
// 2. Implement change detection
// 3. Use Power BI's automatic refresh features
FinalResult = CurrentData
in
FinalResult
Let's build a comprehensive data pipeline that demonstrates advanced web data integration techniques. We'll create a competitive intelligence dashboard that monitors multiple data sources.
Scenario: Your company needs to track competitor pricing, social media sentiment, and market news. You'll build a pipeline that:
let
// Configuration for multiple sources
Sources = [
PricingApi = [
Url = "https://api.competitor.com/pricing",
Type = "JSON",
RefreshFrequency = #duration(0, 1, 0, 0), // Every hour
Headers = [#"Authorization" = "Bearer " & #"API Key Parameter"]
],
NewsRss = [
Url = "https://industry-news.example.com/rss",
Type = "XML",
RefreshFrequency = #duration(0, 0, 15, 0), // Every 15 minutes
Headers = []
],
SocialWeb = [
Url = "https://social-monitoring.example.com/mentions?company=competitor",
Type = "HTML",
RefreshFrequency = #duration(0, 0, 30, 0), // Every 30 minutes
Headers = []
]
],
// Generic source processor
ProcessSource = (sourceName as text, config as record) =>
let
Response = try
if config[Type] = "JSON" then
Json.Document(Web.Contents(config[Url], [Headers = config[Headers]]))
else if config[Type] = "XML" then
Xml.Tables(Web.Contents(config[Url], [Headers = config[Headers]]))
else if config[Type] = "HTML" then
Web.Page(Web.Contents(config[Url], [Headers = config[Headers]]))
else
error "Unsupported source type"
otherwise null,
// Add metadata
WithMetadata = [
Source = sourceName,
Data = Response,
FetchTime = DateTime.UtcNow(),
Success = Response <> null
]
in
WithMetadata,
// Process all sources
ProcessedSources = Record.TransformFields(Sources,
List.Transform(Record.FieldNames(Sources),
each {_, (config) => ProcessSource(_, config)})),
// Extract successful results
SuccessfulSources = Record.SelectFields(ProcessedSources,
List.Select(Record.FieldNames(ProcessedSources),
each Record.Field(ProcessedSources, _)[Success] = true))
in
SuccessfulSources
let
// Data quality assessment function
AssessDataQuality = (data as any, sourceType as text) =>
let
Score = if sourceType = "JSON" then
let
// JSON quality checks
RecordCount = try List.Count(data) otherwise 0,
HasValidStructure = try data{0}[id] <> null otherwise false,
QualityScore =
(if RecordCount > 0 then 30 else 0) +
(if HasValidStructure then 40 else 0) +
30 // Base score for successful fetch
in
QualityScore
else if sourceType = "HTML" then
let
// HTML quality checks
HasTables = try List.Count(data{[Name="Tables"]}[Data]) > 0 otherwise false,
QualityScore =
(if HasTables then 50 else 0) +
50 // Base score for successful fetch
in
QualityScore
else
50, // Default score for other types
QualityGrade = if Score >= 80 then "High"
else if Score >= 60 then "Medium"
else if Score >= 40 then "Low"
else "Poor"
in
[Score = Score, Grade = QualityGrade],
// Apply quality assessment (using previous exercise result)
PreviousResult = #"Previous Step", // Reference to previous step
WithQuality = Record.TransformFields(PreviousResult,
List.Transform(Record.FieldNames(PreviousResult),
each {_, (sourceData) =>
Record.AddField(sourceData, "Quality",
AssessDataQuality(sourceData[Data], "JSON"))}))
in
WithQuality
let
// Transform each source to common schema
TransformToCommonSchema = (sourceData as record) =>
let
SourceName = sourceData[Source],
Transformed = if SourceName = "PricingApi" then
let
Data = sourceData[Data],
Table = Table.FromList(Data, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
Expanded = Table.ExpandRecordColumn(Table, "Column1", {"product", "price", "date"}),
Standardized = Table.AddColumn(Expanded, "Category", each "Pricing")
in
Standardized
else if SourceName = "NewsRss" then
let
// Process RSS/XML data
Data = sourceData[Data],
Items = Data{0}[Table],
Standardized = Table.AddColumn(Items, "Category", each "News")
in
Standardized
else
// Default transformation
#table({"Category", "Value", "Timestamp"},
{{"Unknown", "No data", DateTime.UtcNow()}}),
// Add source metadata to every row
WithMetadata = Table.AddColumn(Standardized, "SourceMetadata",
each [
Source = SourceName,
FetchTime = sourceData[FetchTime],
Quality = sourceData[Quality]
])
in
WithMetadata,
// Apply transformation (using previous result)
PreviousResult = #"Previous Step",
TransformedTables = Record.TransformFields(PreviousResult,
List.Transform(Record.FieldNames(PreviousResult),
each {_, TransformToCommonSchema})),
// Combine all tables
AllTables = Record.FieldValues(TransformedTables),
CombinedData = Table.Combine(AllTables)
in
CombinedData
Problem: Queries work initially but fail during scheduled refreshes when OAuth tokens expire.
Solution: Implement token refresh logic and use Power Query parameters for credential management:
let
// Check token expiration before each request
CheckTokenValidity = (token as text) =>
let
TestUrl = "https://api.example.com/auth/verify",
TestResponse = try Web.Contents(TestUrl, [
Headers = [#"Authorization" = "Bearer " & token]
]) otherwise null,
IsValid = TestResponse <> null
in
IsValid,
CurrentToken = #"OAuth Token Parameter",
RefreshToken = #"Refresh Token Parameter",
ValidToken = if CheckTokenValidity(CurrentToken) then
CurrentToken
else
// Implement token refresh
RefreshAccessToken(RefreshToken),
Source = Json.Document(Web.Contents("https://api.example.com/data", [
Headers = [#"Authorization" = "Bearer " & ValidToken]
]))
in
Source
Problem: APIs return 429 errors during data refresh, especially with pagination.
Solution: Implement exponential backoff and rate limit detection:
let
// Rate-limited request function
RateLimitedRequest = (url as text, attempt as number) =>
let
MaxAttempts = 5,
Response = try [
Success = true,
Data = Web.Contents(url),
StatusCode = null
] otherwise
let
// Extract status code from error
ErrorText = try Error.Record()[Detail][Message] otherwise "",
IsRateLimit = Text.Contains(ErrorText, "429")
in [
Success = false,
Data = null,
StatusCode = if IsRateLimit then 429 else 500
],
Result = if Response[Success] then
Response[Data]
else if Response[StatusCode] = 429 and attempt < MaxAttempts then
let
// Exponential backoff: 2^attempt seconds
DelaySeconds = Number.Power(2, attempt),
// Wait (simulated - Power Query doesn't have actual delays)
// In production, use Azure Functions or similar for delays
RetryResult = @RateLimitedRequest(url, attempt + 1)
in
RetryResult
else
error "Request failed after " & Text.From(MaxAttempts) & " attempts"
in
Result
in
RateLimitedRequest("https://api.example.com/data", 1)
Problem: API responses change schema, breaking column expansions and transformations.
Solution: Implement defensive programming with schema flexibility:
let
// Flexible column expansion
SafeExpandColumns = (table as table, columnName as text, expectedColumns as list) =>
let
// Get sample record to inspect available columns
SampleRecord = try table{0} otherwise error "Empty table",
SampleColumn = Record.Field(SampleRecord, columnName),
// Get actual available columns
AvailableColumns = if Value.Type(SampleColumn) = type record then
Record.FieldNames(SampleColumn)
else
{},
// Only expand columns that exist
ColumnsToExpand = List.Intersect({expectedColumns, AvailableColumns}),
// Perform expansion if columns are available
Result = if List.Count(ColumnsToExpand) > 0 then
Table.ExpandRecordColumn(table, columnName, ColumnsToExpand)
else
// Add empty columns for missing expected columns
Table.AddColumn(table, columnName & "_NoData", each null)
in
Result,
Source = Json.Document(Web.Contents("https://api.example.com/data")),
Table = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
// Safely expand with expected columns
ExpectedColumns = {"id", "name", "created_at", "status"},
ExpandedTable = SafeExpandColumns(Table, "Column1", ExpectedColumns)
in
ExpandedTable
Problem: HTML structure changes break web scraping queries.
Solution: Use multiple fallback strategies and robust element identification:
let
// Multi-strategy element finder
FindElementsByMultipleStrategies = (htmlTable as table) =>
let
// Strategy 1: Find by exact class name
Strategy1 = Table.SelectRows(htmlTable,
each try [ClassName] = "data-row" otherwise false),
// Strategy 2: Find by partial class match
Strategy2 = if Table.RowCount(Strategy1) = 0 then
Table.SelectRows(htmlTable,
each try Text.Contains([ClassName], "data") otherwise false)
else
Strategy1,
// Strategy 3: Find by tag name and content pattern
Strategy3 = if Table.RowCount(Strategy2) = 0 then
Table.SelectRows(htmlTable,
each try [TagName] = "tr" and Text.Contains([TextContent], "$") otherwise false)
else
Strategy2,
// Final fallback: get all table rows
FinalResult = if Table.RowCount(Strategy3) = 0 then
Table.SelectRows(htmlTable, each [TagName] = "tr")
else
Strategy3
in
FinalResult,
Source = Web.Page(Web.Contents("https://example.com/data-page")),
DataTable = Source{[Name="Tables"]}[Data]{0}[Data],
ExtractedData = FindElementsByMultipleStrategies(DataTable)
in
ExtractedData
You've now mastered the complete spectrum of web data integration with Power Query, from basic API connections to enterprise-grade pipelines with comprehensive error handling and monitoring. The key concepts that distinguish expert-level implementations are:
Architectural thinking: Understanding when to use Web.Contents() versus Web.Page(), how Power Query's execution model affects web requests, and designing for scalability from the start.
Resilience patterns: Building queries that handle the inherent instability of web sources through sophisticated error handling, retry logic, and graceful degradation.
Production readiness: Implementing monitoring, parameterization, and data quality controls that enable reliable operation in enterprise environments.
Integration complexity: Combining multiple web sources with different schemas, authentication methods, and update frequencies into unified datasets.
The techniques you've learned here form the foundation for advanced data integration scenarios. Consider these natural progression paths:
Custom Connector Development: When you're regularly working with specific APIs, building custom Power Query connectors provides better performance, built-in authentication handling, and query folding capabilities.
Real-time Integration Patterns: Explore how Power BI's streaming datasets and Azure Stream Analytics can complement Power Query for true real-time scenarios.
Data Governance and Lineage: Implement comprehensive metadata tracking and automated documentation systems for web-sourced data in enterprise environments.
Advanced Authentication Systems: Deep-dive into enterprise identity providers, certificate-based authentication, and custom authentication schemes for secure web data access.
The competitive advantage in data-driven organizations increasingly comes from automated intelligence gathering. You now have the tools to build those systems that transform scattered web data into strategic insights, giving your organization the information edge it needs to succeed.
Learning Path: Power Query Essentials