PinnedPublished inTDS ArchiveMy First Billion (of Rows) in DuckDBFirst Impressions of DuckDB handling 450Gb in a real projectMay 1, 202411May 1, 202411
Published inTDS ArchiveAnatomy of Windows FunctionsTheory and practice of an underappreciated SQL operationJun 11, 20241Jun 11, 20241
Published inTDS ArchiveAutomatically Detecting Label Errors in Datasets with CleanLabA Tale of AI and wrongly-classified Brazilian Federal LawsJul 22, 2023Jul 22, 2023
Published inTDS ArchiveAutomatically Managing Data Pipeline Infrastructures With TerraformI know the manual work you did last summerMay 2, 2023May 2, 2023
Published inTDS ArchiveData Pipeline with Airflow and AWS Tools (S3, Lambda & Glue)Learning a little about these tools and how to integrate themApr 6, 20232Apr 6, 20232
Published inTDS ArchiveCreating a Data Pipeline with Spark, Google Cloud Storage and Big QueryOn-premise and cloud working together to deliver a data productMar 6, 20232Mar 6, 20232
Published inTDS ArchiveHands-On Introduction to Delta Lake with (py)SparkConcepts, theory, and functionalities of this modern data storage frameworkFeb 16, 20233Feb 16, 20233
Temporal and Geo-referenced Traffic Management with Python+StreamlitApplying modern tools to visualize time and spatial data in a dashboardJan 29, 20231Jan 29, 20231
Published inTDS ArchiveFirst Steps in Machine Learning with Apache SparkBasic concepts and topics of Spark MLlib packageJan 4, 2023Jan 4, 2023
Published inTDS ArchiveA Fast Look at Spark Structured Streaming + KafkaLearning the basics of how to use this powerful duo for stream-processing tasksNov 5, 20224Nov 5, 20224