상세 컨텐츠

본문 제목

[SQL 스터디_1팀] 5주차_강의노트 (1)

심화 스터디/SQL 스터디

by bravesk 2023. 5. 9. 10:48

본문

https://stormy-soup-0ee.notion.site/91cdc612c424414495835bd67f5f59fd- 상세 설명 (이미지 포함) 위의 노션 참고 

 

백서경

SQL Fundamentals

stormy-soup-0ee.notion.site

SQL Fundamentals

Data Manipulation in SQL

3. Correlated Queries, Nested Queries, and Common Table Expressions

  • 3.1. Correlated subqueries
    • 정의 : outer query의 값을 이용해서 결과를 내는 special subquery( 내부 subquery에서 외부 table의 값을 참조하고 싶을 때)
    • 실행 : final dataset의 모든 row마다 다시 실행됨
    • 사용 : advanced joining, filtering, evaluating data 같은 특별한 연산에 사용
    • 예시 : 평균보다 많은 골이 득점된 경기 찾기 
    • SELECT s.stage, ROUND(s.avg_goals,2) AS avg_goal, (SELECT AVG(home_goal + away_goal) FROM match WHERE season='2012/2013') AS overall_avg FROM (SELECT stage, AVG(home_goal+away_goal) AS avg_goals FROM match WHERE season='2012/2013' GROUP BY stage) AS s WHERE s.avg_goals>(SELECT AVG(home_goal+away_goal) FROM match AS m WHERE s.stage>m.stage);
    • SELECT s.stage, ROUND(s.avg_goals,2_ AS avg_goal, (SELECT AVG(home_goal + away_goal) FROM match WHERE season='2012/2013') AS overall_avg FROM (SELECT stage, AVG(home_goal+away_goal) AS avg_goals FROM match WHERE season='2012/2013' GROUP BY stage) AS s WHERE s.avg_goals > (SELECT AVG(home_goal+away_goal) FROM match WHERE season='2012/2013');
    • simple vs correlated subqueries
    simple
    • main query와 독립적으로 실행될 수 있음
    • 전체 query에서 한번만 실행됨
    • 예시 : 국가별 평균 골
    SELECT
    	c.name AS country
    	AVG(m.home_goal+m.away_goal)
    		AS avg_goals
    FROM country AS c
    LEFT JOIN match AS m
    ON c.id=m.country_id
    GROUP BY country;
    
    correlated
    • main query가 실행되어야 실행 가능
    • 생성되는 모든 row마다 실행됨 → 실행시간 증가
    SELECT
    	c.name AS country,
    	(SELECT
    		AVG(home_goal+away_goal)
    	FROM match AS m
    	WHERE m.country_id=c.id)
    		AS avg_goals
    FROM country AS c
    GROUP BY country;
    
  • 3.2. Nested subqueries
    • 정의 : 다른 subquery에 nested 된 subquery
    • 사용: 여러번의 변환이 필요할 때
    • 예시 : 월별 총 골 수와 월별 평균 골 수의 차이Extract : 날짜에서 월을 가져옴
    • SELECT EXTRACT(MONTH FROM date) AS month, SUM(m.home_goal+m.away_goal) AS total_goals, SUM(m.home_goal+m.away_goal)- (SELECT AVG(goals) FROM (SELECT EXTRACT(MONTH from date) AS month, SUM(home_goal+away_goal) AS goals FROM match GROUP BY month) AS s) AS diff FROM match AS m GROUP by month;
    • correlated nested subqueries
      SELECT 
      	c.name AS country,
      	(SELECT AVG(home_goal+away_goal)
      	FROM match AS m
      	WHERE m.coutry_id=c.id
      		AND id IN(
      			SELECT id
      			FROM match
      			WHERE season='2011/2012')) AS avg_goals
      FROM country AS c
      GROUP BY country;
      
    • 예시 : 2011/2012 시즌에 각 국가의 평균 골 수
  • 3.3. Common Table Expressions(CTE)
    • 필요성 : 길어지고 읽기 어려운 query
    • 정의 : main query 전에 선언된 table
    • 사용 : WITH
    • WITH cte AS( SELECT col1,col2 FROM table)
    • 예시
    • WITH s AS( SELECT country_id,id FROM match WHERE (home_goal+away_goal)>=10 ) SELECT c.name AS country, COUNT(s.id) AS matches, FROM country AS c INNER JOIN s ON c.id=s.country_id GROUP BY country;
    • multiple CTEs
    • WITH s1 AS( SELECT country_id,id FROM match WHERE (home_goal+away_goal)>=10), s2 AS( SELECT country_id,id FROM match WHERE (home_goal+away_goal)<=1 )
    • why?
      • 한번만 실행됨 - 메모리에 저장되어 query 효율이 좋아짐
      • query를 정리하는 데에 도움
      • 다른 CTE를 reference 할 수 있음
      • 스스로를 reference 할 수 있음(SELF JOIN)
  • 3.4. Deciding on techniques to use
    • 같은 task를 할 수 있는 여러 가지 방법을 배웠음
    • differentiating Techniques
      • 2+ table 합칠 때
      • 간단한 연산
      correlated subqueries
      • subquery와 table을 match
      • 한 table 당 두 개의 column을 join 할 수 없는 join의 한계 극복
      • 시간 오래 걸림
    • Join
    Multiple/Nested Subqueries
    • 여러 단계의 변환이 필요할 때
    Common Table Expressions
    • 다양한 piece of information이 필요할 때
    • subquery를 순차적으로 정리
    • 다른 CTE를 참조 가능

4. Window Functions

  • 4.1. It’s OVER
    • 필요성
      • 열 기준 연산들 → 행을 기준으로는?
      • GROUP BY : aggregate value를 non-aggregate data와 비교할 수 없음
    • Window functions
      • 이미 generate된 result set(window)을 가지고 연산, 비교할 수 있도록
      • data를 group by 하지 않아도 aggregate calculation을 할 수 있음 - SELECT 안의 subquery처럼
        • running total, ranking, moving average 등
    • 비교 : 2011/2012 시즌의 득점 수, 득점 평균
      SELECT
      	date,
      	(home_goal+away_goal) AS goals,
      	(SELECT AVG(home_goal+away_goal)
      		FROM match
      		WHERE season='2011/2012') AS 
      			overall_avg
      FROM match
      WHERE season='2011/2012';
      
      SELECT
      	date,
      	(home_goal+away_goal) AS goals,
      	AVG(home_goal+away_goal) OVER() AS overall_avg
      FROM match
      WHERE season='2011/2012';
      
      generated result set을 바로 사용
    • Generate a RANK
      SELECT
      	date,
      	(home_goal + away_goal) AS goals,
      	RANK() OVER(ORDER BY home_goal+away_goal DESC) AS goals_rank
      FROM match
      WHERE season='2011/2012'
      
      RANK() OVER(ORDER BY {순위를 매기고 싶은 기준)}동점이 있으면 skip 1→3
    • default : ASC
    • 예시 : 득점 수를 순위 매기기
    • key differences
      • ORDER BY를 제외하고 모든 query 이후에 연산되기 때문에, 기존 database가 아닌 result set의 정보를 사용함
      • SQLite에는 없음
  • 4.2. OVER with a PARTITION
    • PARTITION BY
      • 서로 다른 카테고리에서 value를 연산할 수 있음
      • 같은 column에서 다른 aggregate value를 연산할 수 있게 함
      • 사용 : AVG(home_goal) OVER(PARTITION BY {카테고리})
      • SELECT date, (home_goal+away_goal) AS goals, AVG(home_goal+away_goal) OVER(PARTITION BY season) AS season_avg FROM match;
      • 각 match의 득점 수, season의 평균 득점 수
    • PARTITION BY multiple columns
    • OVER(PARTITION BY m.season,c.name) AS season_ctry_avg
    • PARTITION BY considerations
      • 한 개 이상의 column에 대한 data를 partition 할 수 있음
      • aggregate calculation, rank 등에 대해서도 partition 할 수 있음
  • 4.3. Sliding windows
    • 필요성
    • aggregate function , rank 외에 dataset의 subsequent row에 따라 변하는 값을 calculate
    • Sliding windows
      • 현재 row에 대해 상대적으로 연산을 수행하는 window function
      • 1개 이상의 column에 대해 partition 될 수 있음
    • Sliding window keywords
      • <start>, <finish>에 들어가는 키워드FOLLOWING : row afterUNBOUNDED FOLLOWING : every row til end
      • CURRENT ROW : stop at current row
      • UNBOUNDED PRECEDING : every row since beginning
      • PRECEDING : row before
      • 예시
        SELECT
        	date,
        	home_goal,
        	away_goal,
        	SUM(home_goal)
        		OVER(ORDER BY date ROWS BETWEEN
        			UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
        FROM match
        WHERE hometown_id=8456 AND season='2011/2012';
        
    • ROWS BETWEEN <start> AND <finish>
    • Sliding window frame
      OVER(ORDER BY date ROWS BETWEEN
      			1 PRECEDING AND CURRENT ROW) AS running_total
      
  • 4.4. Bringing it all together
    • case study : Who defeated 맨유 in 2013/2014 season?
      • get team names with CTEs
      • get match outcome with CASE
      • determine how badly they los with window function
      -- Set up the home team CTE
      WITH home AS (
        SELECT m.id, t.team_long_name,
      	  CASE WHEN m.home_goal > m.away_goal THEN 'MU Win'
      		   WHEN m.home_goal < m.away_goal THEN 'MU Loss' 
        		   ELSE 'Tie' END AS outcome
        FROM match AS m
        LEFT JOIN team AS t ON m.hometeam_id = t.team_api_id),
      -- Set up the away team CTE
      away AS (
        SELECT m.id, t.team_long_name,
      	  CASE WHEN m.home_goal > m.away_goal THEN 'MU Loss'
      		   WHEN m.home_goal < m.away_goal THEN 'MU Win' 
        		   ELSE 'Tie' END AS outcome
        FROM match AS m
        LEFT JOIN team AS t ON m.awayteam_id = t.team_api_id)
      -- Select columns and and rank the matches by goal difference
      SELECT DISTINCT
          m.date,
          home.team_long_name AS home_team,
          away.team_long_name AS away_team,
          m.home_goal, m.away_goal,
          RANK() OVER(ORDER BY ABS(home_goal - away_goal) DESC) as match_rank
      -- Join the CTEs onto the match table
      FROM match AS m
      LEFT JOIN home ON m.id = home.id
      LEFT JOIN away ON m.id = away.id
      WHERE m.season = '2014/2015'
            AND ((home.team_long_name = 'Manchester United' AND home.outcome = 'MU Loss')
            OR (away.team_long_name = 'Manchester United' AND away.outcome = 'MU Loss'));
      

PostgreSQL Summary Stats and Window Functions

1. Introduction to window functions

  • 1.1. Introduction
    • Window functions
      • 현재 row와 관련 있는 연산을 수행
      • GROUP BY 로 aggregate 연산을 수행하는 것과 비슷하지만, 모든 row가 결과에 포함됨
      • 사용
        • 바로 이전, 이후 row의 값을 fetch
        • row의 rank를 assign
        • running total, running average
    • Anatomy of a window function
      • OVER 가 window function임을 나타냄
      • subquery 예 : ORDER BY, PARTITION BY, ROWS/RANGE PRECEDING/FOLLOWING/UNBOUNDED
    • FUCTION_NAME() OVER (empty or subquery)
    • Row numbers
      • 대표적인 window function
      ROW_NUMBER결과
    • SELECT YEAR, Event, Country, ROW_NUMBER() OVER () AS Row_N FROM Summer_Medals WHERE Medal='Gold';
  • 1.2. ORDER BY
    • ORDER BY as subclauses within OVER clause
    • 예시 : ORDER BY Year DESC : 가장 최근 연도가 1이 됨
    • SELECT Year, Event, Country, ROW_NUMBER() OVER (ORDER BY Year DESC) AS Row_N FROM Summer_Medals WHERE Medal='Gold';
    • Ordering by multiple columns
    • ROW_NUMBER() OVER (ORDER BY Year DESC, EVENT ASC) AS Row_N
    • Ordering in-and-outside OVER
      • in : assign Row_N
      • out : 결과 dataset의 출력 순서
      SELECT
      	Year, Event, Country,
      	ROW_NUMBER() OVER (ORDER BY Year DESC, Event ASC) AS Row_N
      FROM Summer_Medals
      WHERE
      	Medal='Gold';
      ORDER BY Country ASC, Row_N ASC;
      
      결과
      • 실행 순서 : in(OVER)→out
    • Application : Reigning champion( 현재, 이전 년도를 모두 이긴 champion)
      • 전년도와 이번 년도의 champion이 같은 row의 두 column이 되어 비교할 수 있어야 함
      • keyword : LAG(column,n)
        • 현재 row에서 같은 columns의 n개 이전 row값을 return
        • n=1이면 바로 이전 row의 값을 반환
        WITH Discus_Gold AS (
        	SELECT 
        		Year, Country AS Champion
        	FROM Summer_Medals
        	WHERE
        		Year IN (1996, 2000, 2004, 2008, 2012)
        		AND Gender='Men' AND Medal='Gold'
        		AND Event='Discus Throw')
        
        SELECT
        	YEAR, Champion,
        	LAG(Champion, 1) OVER
        		(ORDER BY Year ASC) AS Last_Champion
        FROM Discus_Gold
        ORDER BY Year ASC;
        
        결과
  • 1.3. PARTITION BY
    • column의 값에 따라 split table into partition
    • GROUP BY와 역할이 유사하지만 하나의 column으로 합쳐지는 것이 아님
    • window function을 이용해서 따로따로 연산됨
      • ROW_NUMBER : 각 partition 마다 reset
      • LAG : 같은 partition의 previous value만 fetch
    • 예시 : 종목별로 reigning champion 찾기결과
    • WITH Discus_Gold AS (...) SELECT Year, Event, Champion, LAG(Champion) OVER (PARTITION BY Event ORDER BY Event ASC, YEAR ASC) AS Last_Champion FROM Discus_Gold ORDER BY Event ASC, YEAR ASC;
    • Partitioning by multiple columns
    • ROW_NUMBER() OVER (PARTITION BY Year, Country)

2. Fetching, ranking, and paging

  • 2.1. Fetching
    • Fetching : table의 값을 하나의 column으로 fetch해 오는 것
    • The four functions
      • Relative : current row를 기준으로
        • LAG(column, n) : 현재 row를 기준으로 같은 column의 n row 전의 값을 return
        • LEAD(column,n) : 현재 row를 기준으로 같은 column의 n row 이후 값을 return
      • Absolute: current row와 무관
        • FIRST_VALUE(column) : table 또는 partition의 첫번째 값을 return
        • LAST_VALUE(column) : table 또는 partition의 마지막 값을 return
    • LEAD
      WITH Hosts AS(
      	SELECT DISTINCT Year, City
      	FROM Summer_Medals)
      SELECT
      	Year, City,
      	LEAD(City,1) OVER (ORDER BY Year ASC)
      		AS Next_City,
      	LEAD(City,2) OVER (ORDER BY YEAR ASC)
      		AS After_Next_City
      FROM Hosts
      ORDER BY Year ASC;
      
    • 예시 : 해당 년도 올림픽 개최국, 다음 올림픽 개최국, 그 다음 올림픽 개최국 fetch
    • FIRST_VALUE and LAST_VALUE
      • LAST_VALUE는 range를 정해줘야 한다
      SELECT
      	Year, City,
      	FIRST_VALUE(City) OVER
      		(ORDER BY YEAR ASC) AS First_City,
      	LAST_VALUE(City) OVER (
      		ORDER BY YEAR ASC
      		RANGE BETWEEN
      			UNBOUNDED PRECEDING AND
      			UNBOUNDED FOLLOWING
      		) AS Last_City
      FROM Hosts
      ORDER BY Year ASC;
      
    • 예시 : 첫번째 올림픽 개최국, 마지막 올림픽 개최국
    • PARTITION BY 와 함께 사용할 수 있다
  • 2.2. Ranking
    • The ranking functionsROW_NUMBER
      • unique하게 assign 됨. 값이 같아도 다른 값을 가지고 rank를 매김
      • 마지막 rank가 전체 row의 수와 같다
      RANK
      • 같은 값이면 같은 값을 assign 하고 skip 1→3
      • 마지막 rank가 전체 row의 수와 같다
      DENSE_RANK
      • 같은 값이면 같은 값을 assign하지만, skip 하지 않음. 1→2
      • 마지막 rank의 값은 unique value의 수
    • Ranking with partitioning
      WITH Country_Medals AS (...)
      
      SELECT 
      	Country, Athlete,
      	DENSE_RANK()
      		OVER (PARTITION BY Country
      						ORDER BY Medals DESC) AS Rank_N
      	FROM Country_Medals
      ORDER BY Country ASC, Medals DESC;
      
    • 예시 : 국가별로 rank
  • 2.3. Paging
    • Paging : data를 비슷한 크기의 chunk로 split 하는 것
      • 보내지는 data의 양을 줄이기 위해 page 단위로 return
      • Performance를 확인하기 위해 quartile이나 third로 나눔
    • NTILE(n)
      • n page로 data를 split
      • 예시 : 총 67 row의 data를 15개의 chunk로 나눔결과
      • WITH Disciplines AS ( SELECT DISTINCT Discipline FROM Summer_Medals) SELECT Discipline, NTILE(15) OVER () AS Page FROM Disciplines ORDER BY PAGE Asc;
      • Top, middle, and bottom thirds
        • n=3으로 설정, ORDER BY
        • 예시 : 메달 수를 기준으로 3개의 chunk로 나눔
        • WITH Country_Medals AS( SELECT Country, COUNT(*) AS Medals FROM Summer_Medals GROUP BY Country), SELECT Country, Medals NTILE(3) OVER (ORDER BY Medals DESC) AS THIRD FROM Country_Medals;
      • 활용 : Thirds averages
        • 예시 : 메달 수를 기준으로 3개의 chunk로 국가를 나눴을 때, 각 chunk 내부에서 메달 개수의 평균 구하기
        • WITH Country_Medals AS (...), Thirds AS( SELECT Country, Medals, NTILE(3) OVER (ORDER BY Medals DESC) AS Third FROM Country_Medals) SELECT Third, ROUND(AVG(Medals),2) AS Avg_Medals FROM Thirds GROUP BY Third ORDER BY Third ASC;

관련글 더보기

댓글 영역