[Spark] Spark에서 Iceberg 테이블 다루기
테이블 생성 및 업데이트, 병합 쿼리 / partitionOverwriteMode, storeAssignmentPolicy
Posted by
Wonyong Jang
on October 09, 2024 ·
7 mins read
[Spark] PySpark 개발환경 구성과 주요기능
scala 와 python 을 이용한 Spark 비교 / Temp View / Python Package Management / spark-submit 옵션
Posted by
Wonyong Jang
on August 08, 2024 ·
11 mins read
[Spark] Dynamic Partition Pruning / Speculative Execution
filter push down / dimension 테이블과 fact 테이블 조인시 쿼리 성능 최적화
Posted by
Wonyong Jang
on May 15, 2024 ·
5 mins read
[Spark] Join Strategies 과 Shuffle
shuffle join, broadcast join / shuffle sort merge join, broadcast hash join
Posted by
Wonyong Jang
on April 20, 2024 ·
7 mins read
[Spark] Adaptive Query Execution
Broadcast Hash Join / coalescing shuffle partitons, switching join strategies, optimizing skew joins
Posted by
Wonyong Jang
on April 15, 2024 ·
8 mins read
[Spark] On Kubernetes
EMR Cluster 에서의 Spark와 비교
Posted by
Wonyong Jang
on March 03, 2024 ·
2 mins read
[Spark] Memory 관리 및 튜닝
Spark 실행시 적절한 Driver와 Executor 개수
Posted by
Wonyong Jang
on February 13, 2024 ·
4 mins read
[Spark] Log4j를 이용한 Log Rolling(RollingFileAppender)
Custom Log4j 사용하기 / Long Running Spark Streaming 에서 Log Rolling
Posted by
Wonyong Jang
on November 19, 2023 ·
3 mins read
[Spark] 테스트 코드 작성하기
scalatest, spark-testing-base 라이브러리를 이용한 단위 테스트(rdd, dataFrame, dataSet)
Posted by
Wonyong Jang
on September 29, 2023 ·
7 mins read
[Spark] Spark streaming processing delay (Incident Review)
Monitor Spark streaming applications on Amazon EMR / StreamingListener
Posted by
Wonyong Jang
on July 09, 2023 ·
14 mins read
[Spark ML] LightGBM 알고리즘 Spark로 구현하기
주요 하이퍼 파라미터 / 조기 중단 기능(Early Stopping)
Posted by
Wonyong Jang
on April 08, 2023 ·
4 mins read
[Spark ML] Spark ML 결정 트리
DecisionTreeClassifier, RandomForestClassifier, GBTClassifier / MulticlassClassificationEvaluator, BinaryClassificationEvaluator
Posted by
Wonyong Jang
on April 05, 2023 ·
9 mins read
[Spark ML] Spark ML 데이터 전처리
Label Encoding(StringIndexer, IndexToString) / OneHotEncoderEstimator / Scaling(StandardScaler, MinMaxScaler)
Posted by
Wonyong Jang
on April 02, 2023 ·
19 mins read
[Spark ML] Spark ML로 iris 붓꽃 데이터 예측 모델 만들기
randomSplit, vectorAssembler, pipeline / crossValidator, trainValidationSplit 교차검증 및 하이퍼 파라미터 튜닝
Posted by
Wonyong Jang
on April 01, 2023 ·
13 mins read
[Spark] 설치 및 실습 환경 구성하기
scala언어의 spark prompt를 실행하는 script / docker 를 이용한 spark 실행 / databricks 플랫폼 community edition
Posted by
Wonyong Jang
on January 27, 2023 ·
8 mins read
[Spark] Structured Streaming 전환 하기
Migration Spark Streaming to Structured Streaming / Structured Streaming 과 Kinesis 연동 / checkpoint와 initialPosition / TroubleShooting
Posted by
Wonyong Jang
on March 07, 2022 ·
15 mins read
[Spark] Structured Streaming 으로 Word Count 구현하기
append 모드와 update 모드의 watermarking / late data에 대한 handling
Posted by
Wonyong Jang
on January 07, 2022 ·
7 mins read
[Spark] Structured Streaming Fault Tolerance
Planner, Source, State, Sink
Posted by
Wonyong Jang
on January 05, 2022 ·
3 mins read
[Spark] Structured Streaming 이란
Spark Streaming 비교 및 사용시 문제점 / Unbounded Table / OutputMode(Complete, Update, Append)
Posted by
Wonyong Jang
on January 03, 2022 ·
7 mins read
[Spark] Docker Ubuntu 컨테이너로 Spark 실습환경 만들기
도커를 이용한 master, worker 클러스터 환경 구성 / spark-submit / 스탠드 얼론 클러스터 매니저
Posted by
Wonyong Jang
on August 29, 2021 ·
14 mins read
[Spark] Broadcast, Accumulator 공유변수
braodcast, accumulator, closure
Posted by
Wonyong Jang
on July 08, 2021 ·
4 mins read
[Spark] How to override a spark dependency in cluster mode(AWS EMR)
라이브러리 버전 충돌이 발생할 때 shadowJar를 사용하여 package relocate
Posted by
Wonyong Jang
on July 08, 2021 ·
5 mins read
[Spark] Dynamic Resource Allocation in AWS EMR Cluster
Spark (Streaming) Dynamic Allocation / External Shuffle Service / ec2기반 aws emr auto scaling 트러블 슈팅
Posted by
Wonyong Jang
on June 25, 2021 ·
12 mins read
[Spark] Persistence 와 Data Locality
RDD Persistence / memory, disk cache / locality level(PROCESS LOCAL, NODE LOCAL, RACK LOCAL)
Posted by
Wonyong Jang
on June 23, 2021 ·
11 mins read
[Spark] 아파치 스파크 Partitioning
RDD on a Cluster / Partiton 개수와 크기 정하기 / coalesce 와 repartition / spark.sql.files.maxPartitionBytes
Posted by
Wonyong Jang
on June 21, 2021 ·
9 mins read
[Spark] 아파치 스파크 Serialization
Serialization challenges with Spark and Scala / Passing function to spark
Posted by
Wonyong Jang
on June 15, 2021 ·
16 mins read
[Spark] Pipeline and Stage
Stage skip 되는 경우 / 셔플에 의한 stage 분리 / 셔플 발생시 write, read
Posted by
Wonyong Jang
on May 10, 2021 ·
6 mins read
[Spark] 아파치 스파크(spark) DataSet
DataSet 의 주요 연산 사용법 / Encoder
Posted by
Wonyong Jang
on May 07, 2021 ·
11 mins read
[Spark] 아파치 스파크(spark) SQL 의 Tungsten Project
Spark SQL 사용시 하드웨어(cpu, memory 등) 최적화 제공
Posted by
Wonyong Jang
on May 04, 2021 ·
10 mins read
[Spark] 아파치 스파크(spark) SQL 의 Catalyst Optimizer
Spark SQL 사용시 엔진 차원에서 성능 최적화 / Optimized Query Plan
Posted by
Wonyong Jang
on May 03, 2021 ·
7 mins read
[Spark] 아파치 스파크(spark) DataFrame 구현하기
DataFrame 주요 연산 / groupBy / UDF(User Define Function) / join
Posted by
Wonyong Jang
on May 02, 2021 ·
16 mins read
[Spark] 아파치 스파크(spark) SQL과 DataFrame
RDD vs DataFrame / Catalyst Optimizer / Tungsten execution engine / Encoder
Posted by
Wonyong Jang
on May 01, 2021 ·
5 mins read
[Spark] Streaming Graceful Shutdown
How to do graceful shutdown of spark streaming job / sigkill, sigterm, sigint 차이
Posted by
Wonyong Jang
on April 19, 2021 ·
9 mins read
[Spark] (Structured) Streaming Checkpointing
Spark Streaming과 Structured Streaming Checkpoint, S3 를 Checkpoint 로 사용하여 구현(aws credentials) / S3A 와 EMRFS
Posted by
Wonyong Jang
on April 17, 2021 ·
10 mins read
[Spark] Streaming Data Sources
Kafka, Kinesis / Direct, Receiver Based Data Sources, Fault Tolerance / Backpressure
Posted by
Wonyong Jang
on April 15, 2021 ·
7 mins read
[Spark] Streaming 의 Fault Tolerance 와 Graph
장애 복구 / Dstream의 Graph / Receiver, Network Input Tracker, Job Scheduler, Job Manager, Block manager
Posted by
Wonyong Jang
on April 14, 2021 ·
8 mins read
[Spark] Streaming 의 DStream과 주요 연산
DStream(Discretized Streams) / stateful(window, state) / transform / Receiver Input Stream / Block Manager
Posted by
Wonyong Jang
on April 12, 2021 ·
18 mins read
[Spark] 아파치 스파크(spark) RDD 여러가지 연산
지연 처리 방식의 Transformation, 즉시 실행 방식의 Action / narrow, wide transformation
Posted by
Wonyong Jang
on April 12, 2021 ·
22 mins read
[Spark] 아파치 스파크(spark) 시작하기
Driver, Executor, Node, Job, Stage, Task, Cluster Manager/ RDD, Fault tolerance / Hadoop
Posted by
Wonyong Jang
on April 11, 2021 ·
16 mins read