Download or read book Getting Started with DuckDB written by Simon Aubury and published by Packt Publishing Ltd. This book was released on 2024-06-24 with total page 382 pages. Available in PDF, EPUB and Kindle. Book excerpt: Analyze and transform data efficiently with DuckDB, a versatile, modern, in-process SQL database Key Features Use DuckDB to rapidly load, transform, and query data across a range of sources and formats Gain practical experience using SQL, Python, and R to effectively analyze data Learn how open source tools and cloud services in the broader data ecosystem complement DuckDB’s versatile capabilities Purchase of the print or Kindle book includes a free PDF eBook Book DescriptionDuckDB is a fast in-process analytical database. Getting Started with DuckDB offers a practical overview of its usage. You'll learn to load, transform, and query various data formats, including CSV, JSON, and Parquet. The book covers DuckDB's optimizations, SQL enhancements, and extensions for specialized applications. Working with examples in SQL, Python, and R, you'll explore analyzing public datasets and discover tools enhancing DuckDB workflows. This guide suits both experienced and new data practitioners, quickly equipping you to apply DuckDB's capabilities in analytical projects. You'll gain proficiency in using DuckDB for diverse tasks, enabling effective integration into your data workflows.What you will learn Understand the properties and applications of a columnar in-process database Use SQL to load, transform, and query a range of data formats Discover DuckDB's rich extensions and learn how to apply them Use nested data types to model semi-structured data and extract and model JSON data Integrate DuckDB into your Python and R analytical workflows Effectively leverage DuckDB's convenient SQL enhancements Explore the wider ecosystem and pathways for building DuckDB-powered data applications Who this book is for If you’re interested in expanding your analytical toolkit, this book is for you. It will be particularly valuable for data analysts wanting to rapidly explore and query complex data, data and software engineers looking for a lean and versatile data processing tool, along with data scientists needing a scalable data manipulation library that integrates seamlessly with Python and R. You will get the most from this book if you have some familiarity with SQL and foundational database concepts, as well as exposure to a programming language such as Python or R.
Download or read book DuckDB in Action written by Mark Needham and published by Simon and Schuster. This book was released on 2024-09-10 with total page 310 pages. Available in PDF, EPUB and Kindle. Book excerpt: Dive into DuckDB and start processing gigabytes of data with ease—all with no data warehouse. DuckDB is a cutting-edge SQL database that makes it incredibly easy to analyze big data sets right from your laptop. In DuckDB in Action you’ll learn everything you need to know to get the most out of this awesome tool, keep your data secure on prem, and save you hundreds on your cloud bill. From data ingestion to advanced data pipelines, you’ll learn everything you need to get the most out of DuckDB—all through hands-on examples. Open up DuckDB in Action and learn how to: • Read and process data from CSV, JSON and Parquet sources both locally and remote • Write analytical SQL queries, including aggregations, common table expressions, window functions, special types of joins, and pivot tables • Use DuckDB from Python, both with SQL and its "Relational"-API, interacting with databases but also data frames • Prepare, ingest and query large datasets • Build cloud data pipelines • Extend DuckDB with custom functionality Pragmatic and comprehensive, DuckDB in Action introduces the DuckDB database and shows you how to use it to solve common data workflow problems. You won’t need to read through pages of documentation—you’ll learn as you work. Get to grips with DuckDB's unique SQL dialect, learning to seamlessly load, prepare, and analyze data using SQL queries. Extend DuckDB with both Python and built-in tools such as MotherDuck, and gain practical insights into building robust and automated data pipelines. About the technology DuckDB makes data analytics fast and fun! You don’t need to set up a Spark or run a cloud data warehouse just to process a few hundred gigabytes of data. DuckDB is easily embeddable in any data analytics application, runs on a laptop, and processes data from almost any source, including JSON, CSV, Parquet, SQLite and Postgres. About the book DuckDB in Action guides you example-by-example from setup, through your first SQL query, to advanced topics like building data pipelines and embedding DuckDB as a local data store for a Streamlit web app. You’ll explore DuckDB’s handy SQL extensions, get to grips with aggregation, analysis, and data without persistence, and use Python to customize DuckDB. A hands-on project accompanies each new topic, so you can see DuckDB in action. What's inside • Prepare, ingest and query large datasets • Build cloud data pipelines • Extend DuckDB with custom functionality • Fast-paced SQL recap: From simple queries to advanced analytics About the reader For data pros comfortable with Python and CLI tools. About the author Mark Needham is a blogger and video creator at @?LearnDataWithMark. Michael Hunger leads product innovation for the Neo4j graph database. Michael Simons is a Java Champion, author, and Engineer at Neo4j.
Download or read book In Memory Analytics with Apache Arrow written by Matthew Topol and published by Packt Publishing Ltd. This book was released on 2024-09-30 with total page 406 pages. Available in PDF, EPUB and Kindle. Book excerpt: Harness the power of Apache Arrow to optimize tabular data processing and develop robust, high-performance data systems with its standardized, language-independent columnar memory format Key Features Explore Apache Arrow's data types and integration with pandas, Polars, and Parquet Work with Arrow libraries such as Flight SQL, Acero compute engine, and Dataset APIs for tabular data Enhance and accelerate machine learning data pipelines using Apache Arrow and its subprojects Purchase of the print or Kindle book includes a free PDF eBook Book DescriptionApache Arrow is an open source, columnar in-memory data format designed for efficient data processing and analytics. This book harnesses the author’s 15 years of experience to show you a standardized way to work with tabular data across various programming languages and environments, enabling high-performance data processing and exchange. This updated second edition gives you an overview of the Arrow format, highlighting its versatility and benefits through real-world use cases. It guides you through enhancing data science workflows, optimizing performance with Apache Parquet and Spark, and ensuring seamless data translation. You’ll explore data interchange and storage formats, and Arrow's relationships with Parquet, Protocol Buffers, FlatBuffers, JSON, and CSV. You’ll also discover Apache Arrow subprojects, including Flight, SQL, Database Connectivity, and nanoarrow. You’ll learn to streamline machine learning workflows, use Arrow Dataset APIs, and integrate with popular analytical data systems such as Snowflake, Dremio, and DuckDB. The latter chapters provide real-world examples and case studies of products powered by Apache Arrow, providing practical insights into its applications. By the end of this book, you’ll have all the building blocks to create efficient and powerful analytical services and utilities with Apache Arrow.What you will learn Use Apache Arrow libraries to access data files, both locally and in the cloud Understand the zero-copy elements of the Apache Arrow format Improve the read performance of data pipelines by memory-mapping Arrow files Produce and consume Apache Arrow data efficiently by sharing memory with the C API Leverage the Arrow compute engine, Acero, to perform complex operations Create Arrow Flight servers and clients for transferring data quickly Build the Arrow libraries locally and contribute to the community Who this book is for This book is for developers, data engineers, and data scientists looking to explore the capabilities of Apache Arrow from the ground up. Whether you’re building utilities for data analytics and query engines, or building full pipelines with tabular data, this book can help you out regardless of your preferred programming language. A basic understanding of data analysis concepts is needed, but not necessary. Code examples are provided using C++, Python, and Go throughout the book.
Download or read book Elastic Stack 8 x Cookbook written by Huage Chen and published by Packt Publishing Ltd. This book was released on 2024-06-28 with total page 688 pages. Available in PDF, EPUB and Kindle. Book excerpt: Unlock the full potential of Elastic Stack for search, analytics, security, and observability and manage substantial data workloads in both on-premise and cloud environments Key Features Explore the diverse capabilities of the Elastic Stack through a comprehensive set of recipes Build search applications, analyze your data, and observe cloud-native applications Harness powerful machine learning and AI features to create data science and search applications Purchase of the print or Kindle book includes a free PDF eBook Book DescriptionLearn how to make the most of the Elastic Stack (ELK Stack) products—including Elasticsearch, Kibana, Elastic Agent, and Logstash—to take data reliably and securely from any source, in any format, and then search, analyze, and visualize it in real-time. This cookbook takes a practical approach to unlocking the full potential of Elastic Stack through detailed recipes step by step. Starting with installing and ingesting data using Elastic Agent and Beats, this book guides you through data transformation and enrichment with various Elastic components and explores the latest advancements in search applications, including semantic search and Generative AI. You'll then visualize and explore your data and create dashboards using Kibana. As you progress, you'll advance your skills with machine learning for data science, get to grips with natural language processing, and discover the power of vector search. The book covers Elastic Observability use cases for log, infrastructure, and synthetics monitoring, along with essential strategies for securing the Elastic Stack. Finally, you'll gain expertise in Elastic Stack operations to effectively monitor and manage your system.What you will learn Discover techniques for collecting data from diverse sources Visualize data and create dashboards using Kibana to extract business insights Explore machine learning, vector search, and AI capabilities of Elastic Stack Handle data transformation and data formatting Build search solutions from the ingested data Leverage data science tools for in-depth data exploration Monitor and manage your system with Elastic Stack Who this book is for This book is for Elastic Stack users, developers, observability practitioners, and data professionals ranging from beginner to expert level. If you’re a developer, you’ll benefit from the easy-to-follow recipes for using APIs and features to build powerful applications, and if you’re an observability practitioner, this book will help you with use cases covering APM, Kubernetes, and cloud monitoring. For data engineers and AI enthusiasts, the book covers dedicated recipes on vector search and machine learning. No prior knowledge of the Elastic Stack is required.
Download or read book Polars Cookbook written by Yuki Kakegawa and published by Packt Publishing Ltd. This book was released on 2024-08-23 with total page 394 pages. Available in PDF, EPUB and Kindle. Book excerpt: Leverage Polars, a lightning-fast DataFrame library, to transform your Python-based data science projects with efficient data wrangling and manipulation Key Features Unlock the power of Python Polars for faster and more efficient data analysis workflows Master the fundamentals of Python Polars with step-by-step recipes Discover data manipulation techniques to apply across multiple data problems Purchase of the print or Kindle book includes a free PDF eBook Book DescriptionThe Polars Cookbook is a comprehensive, hands-on guide to Python Polars, one of the first resources dedicated to this powerful data processing library. Written by Yuki Kakegawa, a seasoned data analytics consultant who has worked with industry leaders like Microsoft and Stanford Health Care, this book offers targeted, real-world solutions to data processing, manipulation, and analysis challenges. The book also includes a foreword by Marco Gorelli, a core contributor to Polars, ensuring expert insights into Polars' applications. From installation to advanced data operations, you’ll be guided through data manipulation, advanced querying, and performance optimization techniques. You’ll learn to work with large datasets, conduct sophisticated transformations, leverage powerful features like chaining, and understand its caveats. This book also shows you how to integrate Polars with other Python libraries such as pandas, numpy, and PyArrow, and explore deployment strategies for both on-premises and cloud environments like AWS, BigQuery, GCS, Snowflake, and S3. With use cases spanning data engineering, time series analysis, statistical analysis, and machine learning, Polars Cookbook provides essential techniques for optimizing and securing your workflows. By the end of this book, you'll possess the skills to design scalable, efficient, and reliable data processing solutions with Polars. What you will learn Read from different data sources and write to various files and databases Apply aggregations, window functions, and string manipulations Perform common data tasks such as handling missing values and performing list and array operations Discover how to reshape and tidy your data by pivoting, joining, and concatenating Analyze your time series data in Python Polars Create better workflows with testing and debugging Who this book is for This book is for data analysts, data scientists, and data engineers who want to learn how to use Polars in their workflows. Working knowledge of the Python programming language is required. Experience working with a DataFrame library such as pandas or PySpark will also be helpful.
Download or read book Amazon DynamoDB The Definitive Guide written by Aman Dhingra and published by Packt Publishing Ltd. This book was released on 2024-08-30 with total page 415 pages. Available in PDF, EPUB and Kindle. Book excerpt: Harness the potential and scalability of DynamoDB to effortlessly construct resilient, low-latency databases Key Features Discover how DynamoDB works behind the scenes to make the most of its features Learn how to keep latency and costs minimal even when scaling up Integrate DynamoDB with other AWS services to create a full data analytics system Purchase of the print or Kindle book includes a free PDF eBook Book DescriptionThis book will help you master Amazon DynamoDB, the fully managed, serverless, NoSQL database service designed for high performance at any scale. Authored by Aman Dhingra, senior DynamoDB specialist solutions architect at AWS, and Mike Mackay, former senior NoSQL specialist solutions architect at AWS, this guide draws on their expertise to equip you with the knowledge and skills needed to harness DynamoDB's full potential. This book not only introduces you to DynamoDB's core features and real-world applications, but also provides in-depth guidance on transitioning from traditional relational databases to the NoSQL world. You'll learn essential data modeling techniques, such as vertical partitioning, and explore the nuances of DynamoDB's indexing capabilities, capacity modes, and consistency models. The chapters also help you gain a solid understanding of advanced topics such as enhanced analytical patterns, implementing caching with DynamoDB Accelerator (DAX), and integrating DynamoDB with other AWS services to optimize your data strategies. By the end of this book, you’ll be able to design, build, and deliver low-latency, high-throughput DynamoDB solutions, driving new levels of efficiency and performance for your applications.What you will learn Master key-value data modeling in DynamoDB for efficiency Transition from RDBMSs to NoSQL with optimized strategies Implement read consistency and ACID transactions effectively Explore vertical partitioning for specific data access patterns Optimize data retrieval using secondary indexes in DynamoDB Manage capacity modes, backup strategies, and core components Enhance DynamoDB with caching, analytics, and global tables Evaluate and design your DynamoDB migration strategy Who this book is for This book is for software architects designing scalable systems, developers optimizing performance with DynamoDB, and engineering managers guiding decision-making. Data engineers will learn to integrate DynamoDB into workflows, while product owners will explore its innovative capabilities. DBAs transitioning to NoSQL will find valuable insights on DynamoDB and RDBMS integration. Basic knowledge of software engineering, Python, and cloud computing is helpful. Hands-on AWS or DynamoDB experience is beneficial but not required.
Download or read book DevOps for Data Science written by Alex Gold and published by CRC Press. This book was released on 2024-06-19 with total page 274 pages. Available in PDF, EPUB and Kindle. Book excerpt: Data Scientists are experts at analyzing, modelling and visualizing data but, at one point or another, have all encountered difficulties in collaborating with or delivering their work to the people and systems that matter. Born out of the agile software movement, DevOps is a set of practices, principles and tools that help software engineers reliably deploy work to production. This book takes the lessons of DevOps and aplies them to creating and delivering production-grade data science projects in Python and R. This book’s first section explores how to build data science projects that deploy to production with no frills or fuss. Its second section covers the rudiments of administering a server, including Linux, application, and network administration before concluding with a demystification of the concerns of enterprise IT/Administration in its final section, making it possible for data scientists to communicate and collaborate with their organization’s security, networking, and administration teams. Key Features: • Start-to-finish labs take readers through creating projects that meet DevOps best practices and creating a server-based environment to work on and deploy them. • Provides an appendix of cheatsheets so that readers will never be without the reference they need to remember a Git, Docker, or Command Line command. • Distills what a data scientist needs to know about Docker, APIs, CI/CD, Linux, DNS, SSL, HTTP, Auth, and more. • Written specifically to address the concern of a data scientist who wants to take their Python or R work to production. There are countless books on creating data science work that is correct. This book, on the otherhand, aims to go beyond this, targeted at data scientists who want their work to be than merely accurate and deliver work that matters.
Download or read book Analytics Engineering with SQL and dbt written by Rui Pedro Machado and published by "O'Reilly Media, Inc.". This book was released on 2023-12-08 with total page 324 pages. Available in PDF, EPUB and Kindle. Book excerpt: With the shift from data warehouses to data lakes, data now lands in repositories before it's been transformed, enabling engineers to model raw data into clean, well-defined datasets. dbt (data build tool) helps you take data further. This practical book shows data analysts, data engineers, BI developers, and data scientists how to create a true self-service transformation platform through the use of dynamic SQL. Authors Rui Machado from Monstarlab and Hélder Russa from Jumia show you how to quickly deliver new data products by focusing more on value delivery and less on architectural and engineering aspects. If you know your business well and have the technical skills to model raw data into clean, well-defined datasets, you'll learn how to design and deliver data models without any technical influence. With this book, you'll learn: What dbt is and how a dbt project is structured How dbt fits into the data engineering and analytics worlds How to collaborate on building data models The main tools and architectures for building useful, functional data models How to fit dbt into data warehousing and laking architecture How to build tests for data transformations
Download or read book Analyzing Baseball Data with R written by Jim Albert and published by CRC Press. This book was released on 2024-08-01 with total page 418 pages. Available in PDF, EPUB and Kindle. Book excerpt: “Our community has continued to grow exponentially, thanks to those who inspire the next generation. And inspiring the next generation is what the authors of Analyzing Baseball Data with R are doing. They are setting the career path for still thousands more. We all need some sort of kickstart to take that first or second step. You may be a beginner R coder, but you need access to baseball data. How do you access this data, how do you manipulate it, how do you analyze it? This is what this book does for you. But it does more, by doing what sabermetrics does best: it asks baseball questions. Throughout the book, baseball questions are asked, some straightforward, and others more thought-provoking.” From the Foreword by Tom Tango Analyzing Baseball Data with R Third Edition introduces R to sabermetricians, baseball enthusiasts, and students interested in exploring the richness of baseball data. It equips you with the necessary skills and software tools to perform all the analysis steps, from importing the data to transforming them into an appropriate format to visualizing the data via graphs to performing a statistical analysis. The authors first present an overview of publicly available baseball datasets and a gentle introduction to the type of data structures and exploratory and data management capabilities of R. They also cover the ggplot2 graphics functions and employ a tidyverse-friendly workflow throughout. Much of the book illustrates the use of R through popular sabermetrics topics, including the Pythagorean formula, runs expectancy, catcher framing, career trajectories, simulation of games and seasons, patterns of streaky behavior of players, and launch angles and exit velocities. All the datasets and R code used in the text are available for download online. New to the third edition is the revised R code to make use of new functions made available through the tidyverse. The third edition introduces three chapters of new material, focusing on communicating results via presentations using the Quarto publishing system, web applications using the Shiny package, and working with large data files. An online version of this book is hosted at https://beanumber.github.io/abdwr3e/.
Download or read book R for Data Science written by Hadley Wickham and published by "O'Reilly Media, Inc.". This book was released on 2016-12-12 with total page 521 pages. Available in PDF, EPUB and Kindle. Book excerpt: Learn how to use R to turn raw data into insight, knowledge, and understanding. This book introduces you to R, RStudio, and the tidyverse, a collection of R packages designed to work together to make data science fast, fluent, and fun. Suitable for readers with no previous programming experience, R for Data Science is designed to get you doing data science as quickly as possible. Authors Hadley Wickham and Garrett Grolemund guide you through the steps of importing, wrangling, exploring, and modeling your data and communicating the results. You'll get a complete, big-picture understanding of the data science cycle, along with basic tools you need to manage the details. Each section of the book is paired with exercises to help you practice what you've learned along the way. You'll learn how to: Wrangle—transform your datasets into a form convenient for analysis Program—learn powerful R tools for solving data problems with greater clarity and ease Explore—examine your data, generate hypotheses, and quickly test them Model—provide a low-dimensional summary that captures true "signals" in your dataset Communicate—learn R Markdown for integrating prose, code, and results
Download or read book Streaming Databases written by Hubert Dulay and published by "O'Reilly Media, Inc.". This book was released on 2024-08-08 with total page 260 pages. Available in PDF, EPUB and Kindle. Book excerpt: Real-time applications are becoming the norm today. But building a model that works properly requires real-time data from the source, in-flight stream processing, and low latency serving of its analytics. With this practical book, data engineers, data architects, and data analysts will learn how to use streaming databases to build real-time solutions. Authors Hubert Dulay and Ralph M. Debusmann take you through streaming database fundamentals, including how these databases reduce infrastructure for real-time solutions. You'll learn the difference between streaming databases, stream processing, and real-time online analytical processing (OLAP) databases. And you'll discover when to use push queries versus pull queries, and how to serve synchronous and asynchronous data emanating from streaming databases. This guide helps you: Explore stream processing and streaming databases Learn how to build a real-time solution with a streaming database Understand how to construct materialized views from any number of streams Learn how to serve synchronous and asynchronous data Get started building low-complexity streaming solutions with minimal setup
Download or read book R for Data Science written by Hadley Wickham and published by "O'Reilly Media, Inc.". This book was released on 2023-06-08 with total page 579 pages. Available in PDF, EPUB and Kindle. Book excerpt: Use R to turn data into insight, knowledge, and understanding. With this practical book, aspiring data scientists will learn how to do data science with R and RStudio, along with the tidyverse—a collection of R packages designed to work together to make data science fast, fluent, and fun. Even if you have no programming experience, this updated edition will have you doing data science quickly. You'll learn how to import, transform, and visualize your data and communicate the results. And you'll get a complete, big-picture understanding of the data science cycle and the basic tools you need to manage the details. Updated for the latest tidyverse features and best practices, new chapters show you how to get data from spreadsheets, databases, and websites. Exercises help you practice what you've learned along the way. You'll understand how to: Visualize: Create plots for data exploration and communication of results Transform: Discover variable types and the tools to work with them Import: Get data into R and in a form convenient for analysis Program: Learn R tools for solving data problems with greater clarity and ease Communicate: Integrate prose, code, and results with Quarto
Download or read book Learning Spark written by Jules S. Damji and published by O'Reilly Media. This book was released on 2020-07-16 with total page 400 pages. Available in PDF, EPUB and Kindle. Book excerpt: Data is bigger, arrives faster, and comes in a variety of formats—and it all needs to be processed at scale for analytics or machine learning. But how can you process such varied workloads efficiently? Enter Apache Spark. Updated to include Spark 3.0, this second edition shows data engineers and data scientists why structure and unification in Spark matters. Specifically, this book explains how to perform simple and complex data analytics and employ machine learning algorithms. Through step-by-step walk-throughs, code snippets, and notebooks, you’ll be able to: Learn Python, SQL, Scala, or Java high-level Structured APIs Understand Spark operations and SQL Engine Inspect, tune, and debug Spark operations with Spark configurations and Spark UI Connect to data sources: JSON, Parquet, CSV, Avro, ORC, Hive, S3, or Kafka Perform analytics on batch and streaming data using Structured Streaming Build reliable data pipelines with open source Delta Lake and Spark Develop machine learning pipelines with MLlib and productionize models using MLflow
Download or read book Advances in Information Retrieval written by Joemon M. Jose and published by Springer Nature. This book was released on 2020-04-10 with total page 709 pages. Available in PDF, EPUB and Kindle. Book excerpt: This two-volume set LNCS 12035 and 12036 constitutes the refereed proceedings of the 42nd European Conference on IR Research, ECIR 2020, held in Lisbon, Portugal, in April 2020.* The 55 full papers presented together with 8 reproducibility papers, 46 short papers, 10 demonstration papers, 12 invited CLEF papers, 7 doctoral consortium papers, 4 workshop papers, and 3 tutorials were carefully reviewed and selected from 457 submissions. They were organized in topical sections named: Part I: deep learning I; entities; evaluation; recommendation; information extraction; deep learning II; retrieval; multimedia; deep learning III; queries; IR – general; question answering, prediction, and bias; and deep learning IV. Part II: reproducibility papers; short papers; demonstration papers; CLEF organizers lab track; doctoral consortium papers; workshops; and tutorials. *Due to the COVID-19 pandemic, this conference was held virtually.
Download or read book Learning PySpark written by Tomasz Drabas and published by Packt Publishing Ltd. This book was released on 2017-02-27 with total page 273 pages. Available in PDF, EPUB and Kindle. Book excerpt: Build data-intensive applications locally and deploy at scale using the combined powers of Python and Spark 2.0 About This Book Learn why and how you can efficiently use Python to process data and build machine learning models in Apache Spark 2.0 Develop and deploy efficient, scalable real-time Spark solutions Take your understanding of using Spark with Python to the next level with this jump start guide Who This Book Is For If you are a Python developer who wants to learn about the Apache Spark 2.0 ecosystem, this book is for you. A firm understanding of Python is expected to get the best out of the book. Familiarity with Spark would be useful, but is not mandatory. What You Will Learn Learn about Apache Spark and the Spark 2.0 architecture Build and interact with Spark DataFrames using Spark SQL Learn how to solve graph and deep learning problems using GraphFrames and TensorFrames respectively Read, transform, and understand data and use it to train machine learning models Build machine learning models with MLlib and ML Learn how to submit your applications programmatically using spark-submit Deploy locally built applications to a cluster In Detail Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. This book will show you how to leverage the power of Python and put it to use in the Spark ecosystem. You will start by getting a firm understanding of the Spark 2.0 architecture and how to set up a Python environment for Spark. You will get familiar with the modules available in PySpark. You will learn how to abstract data with RDDs and DataFrames and understand the streaming capabilities of PySpark. Also, you will get a thorough overview of machine learning capabilities of PySpark using ML and MLlib, graph processing using GraphFrames, and polyglot persistence using Blaze. Finally, you will learn how to deploy your applications to the cloud using the spark-submit command. By the end of this book, you will have established a firm understanding of the Spark Python API and how it can be used to build data-intensive applications. Style and approach This book takes a very comprehensive, step-by-step approach so you understand how the Spark ecosystem can be used with Python to develop efficient, scalable solutions. Every chapter is standalone and written in a very easy-to-understand manner, with a focus on both the hows and the whys of each concept.
Download or read book XML Web Services and the Data Revolution written by Frank P. Coyle and published by Addison-Wesley Professional. This book was released on 2002 with total page 392 pages. Available in PDF, EPUB and Kindle. Book excerpt: This invaluable guide places XML in context, discussing why it is so significant, and how it affects the business and computing worlds, most recently with the emergence of Web services. It also explores the full ranges of XML related technologies.
Download or read book The Art of SQL written by Stephane Faroult and published by "O'Reilly Media, Inc.". This book was released on 2006-03-10 with total page 369 pages. Available in PDF, EPUB and Kindle. Book excerpt: For all the buzz about trendy IT techniques, data processing is still at the core of our systems, especially now that enterprises all over the world are confronted with exploding volumes of data. Database performance has become a major headache, and most IT departments believe that developers should provide simple SQL code to solve immediate problems and let DBAs tune any bad SQL later. In The Art of SQL, author and SQL expert Stephane Faroult argues that this safe approach only leads to disaster. His insightful book, named after Art of War by Sun Tzu, contends that writing quick inefficient code is sweeping the dirt under the rug. SQL code may run for 5 to 10 years, surviving several major releases of the database management system and on several generations of hardware. The code must be fast and sound from the start, and that requires a firm understanding of SQL and relational theory. The Art of SQL offers best practices that teach experienced SQL users to focus on strategy rather than specifics. Faroult's approach takes a page from Sun Tzu's classic treatise by viewing database design as a military campaign. You need knowledge, skills, and talent. Talent can't be taught, but every strategist from Sun Tzu to modern-day generals believed that it can be nurtured through the experience of others. They passed on their experience acquired in the field through basic principles that served as guiding stars amid the sound and fury of battle. This is what Faroult does with SQL. Like a successful battle plan, good architectural choices are based on contingencies. What if the volume of this or that table increases unexpectedly? What if, following a merger, the number of users doubles? What if you want to keep several years of data online? Faroult's way of looking at SQL performance may be unconventional and unique, but he's deadly serious about writing good SQL and using SQL well. The Art of SQL is not a cookbook, listing problems and giving recipes. The aim is to get you-and your manager-to raise good questions.