首页 > 资料专栏 > IT > IT技术 > 数据存储 > Dzone_2018年大数据技术指南(英文版)2018_38页

Dzone_2018年大数据技术指南(英文版)2018_38页

西安强圣
V 实名认证
内容提供者
热门搜索
资料大小:7893KB(压缩后)
文档格式:WinRAR
资料语言:中文版/英文版/日文版
解压密码:m448
更新时间:2019/9/11(发布于陕西)
阅读:1
类型:积分资料
积分:25分 (VIP无积分限制)
推荐:升级会员

   点此下载 ==>> 点击下载文档


文本描述
Dear Reader,
I first heard the term “Big Data” almost a decade ago. At that time, it
looked like it was nothing new, and our databases would just be up-
graded to handle some more data. No big deal. But soon, it became
clear that traditional databases were not designed to handle Big Data.
The term “Big Data” has more dimensions than just “some more data.”
It encompasses both structured and unstructured data, fast moving
and historical data. Now, with these elements added to the data, some
of the other problems such as data contextualization, data validity,
noise, and abnormality in the data became more prominent.Since
then, Big Data technologies has gone through several phases of devel-
opment and transformation, and they are gradually maturing. A term
that was considered as a fad and a technology ecosystem that was
considered a luxury are slowly establishing themselves as necessary
needs for today’s business activities. Big Data is the new competitive
advantage and it matters for our businesses.
The more we progress and the more automation we implement, data is
always going to transform and grow. Blockchain technologies, Cloud,
and IoT are adding new dimensions to the Big Data trend.Hats of the
developers who are continually innovating and creating new Big Data
Storage and Analytics applications to derive value out of this data. The
fast-paced development has made it easier for us to tame fast-grow-
ing massive data and integrate our existing Enterprise IT infrastructure
with these new data sources. These successes are driven by both En-
terprises and Open Source communities. Withoutopen source proj-
ects like Apache Hadoop, Apache Spark, and Kafka, to name a few, the
landscape would have be entirely diferent. The use of Machine Learn-
ing and Data visualization methods packaged for analyzing Big Data
is also making life easier for analysts and management. However, we
still hear the failure of analytics projects more ofen than the success-
es. There are several reasons why. So, we bring you this guide, where
these articles written by DZone contributors are going to provide you
with more significant insights into these topics.
The Big Data guide is an attempt to help readers discover and help un-
derstand the current landscape of the Big Data ecosystem, where we
stand, and what amazing insights and applications people are discov-
ering in this space. We wish that everyone who reads this guide finds it
worthy and informative. Happy reading!
By Sibanjan Das
BUSINESS ANALYTICS & DATA SCIENCE CONSULTANT & DZONE ZONE LEADER
Executive Summary
BY MATT WERNER_______________________________ 3
Key Research Findings
BY G. RYAN SPAIN _______________________________ 4
Take Big Data to the Next Level with BlockchainNetworks
BY ARJUNA CHALA ______________________________ 6
Solving Data Integration at Stitch Fix
BY LIZ BENNETT _______________________________ 10
Checklist: Ten Tips for Ensuring Your Next DataAnalytics
Project is a Success
BY WOLF RUZICKA, ______________________________ 13
Infographic: Big Data Realization with Sanitation ______ 14
Why Developers Should Bet Big on Streaming
BY JONAS BONR _______________________________ 16
Introduction to Basic Statistics Measurements
BY SUNIL KAPPAL _______________________________ 20
Diving Deeper into Big Data _____________________ 23
Executive Insights on the State of Big Data
BY TOM SMITH _________________________________ 24
Big Data Solutions Directory ____________________ 26
Glossary __________________________________ 36
DZONE IS...
PRODUCTION CHRIS SMITH, DIR. OF
PRODUCTION
ANDRE POWELL, SR. PRODUCTION
COORDINATOR
G. RYAN SPAIN, PRODUCTION
COORDINATOR
ASHLEY SLATE, DESIGN DIR.
BILLY DAVIS, PRODUCTION ASSISSTANT
MARKETINGKELLET ATKINSON, DIR. OF
MARKETING
LAUREN CURATOLA, MARKETING SPECIALIST
KRISTEN PAGN, MARKETING SPECIALIST
NATALIE IANNELLO, MARKETING SPECIALIST
JULIAN MORRIS, MARKETING SPECIALIST
BUSINESSRICK ROSS, CEO
MATT SCHMIDT, PRESIDENT
JESSE DAVIS, EVP
SALESMATT O’BRIAN, DIR. OF
BUSINESS DEV.
ALEX CRAFTS, DIR. OF MAJOR ACCOUNTS
JIM HOWARD, SR ACCOUNT EXECUTIVE
JIM DYER, ACCOUNT EXECUTIVE
ANDREW BARKER, ACCOUNT EXECUTIVE
BRIAN ANDERSON, ACCOUNT EXECUTIVE
RYAN McCOOK, ACCOUNT EXECUTIVE
CHRIS BRUMFIELD, SALES MANAGER
TOM MARTIN, ACCOUNT MANAGER
JASON BUDDAY, ACCOUNT MANAGER
EDITORIALCAITLIN CANDELMO,
DIR. OF CONTENT & COMMUNITY
MATT WERNER, PUBLICATIONS COORD.
MICHAEL THARRINGTON, CONTENT & COMMUNITY
MANAGER
KARA PHELPS, CONTENT & COMMUNITY MANAGER
MIKE GATES, SR. CONTENT COORD.
SARAH DAVIS, CONTENT COORD.
TOM SMITH, RESEARCH ANALYST
JORDAN BAKER, CONTENT COORD.
ANNE MARIE GLEN, CONTENT COORD.
ANDRE LEE-MOYE, CONTENT COORD.
Table of Contents
THE DZONE GUIDE TO BIG DATA: STREAM PROCESSING, STATISTICS, AND SCALABILITY
DZ
ON
E.C
OM
/G
UI
DE
S
PAGE 3 OF 35
THE DZONE GUIDE TO BIG DATA: STREAM PROCESSING, STATISTICS, AND SCALABILITY
Classically, Big Data has been defined by three V’s: Volume, or how
much data you have; Velocity, or how fast data is collected; and
Variety, or how heterogeneous the data set is. As movements like the
Internet of Things provide constant streams of data from hardware,
and AI initiatives require massive data sets to teach machines to
think, the way in which Big Data is stored and utilized continues
to change. To find out how developers are approaching these
challenges, we asked 540 DZone members to tell us about what
tools they’re using to overcome them, and how their organizations
measure successful implementations.
THE PAINS OF THE THREE V’S
DATA
Of several data sources, files give developers the most trouble
when it comes to the volume and variety of data (47% and 56%,
respectively), while 42% of respondents had major issues with the
speed at which both server logs and sensor data were generated.
The volume of server logs was also a major issue, with 46% of
respondents citing it as a pain point.
IMPLICATIONS
As the Internet of Things takes more of a foothold in various industries,
the dificulties in handling all three V’s of Big Data will increase. More
and more applications and organizations are generating files and
documents rather than just values and numbers.
RECOMMENDATIONS
A good way to deal with the constant influx of data from remote
hardware is to use solutions like Apache Kafka, an open source
project built to handle the processing of data in real-time as it is
collected. Currently, 61% of survey respondents are using Kafka, and
we recently released our first Refcard on the topic. Using document
store databases, such as MongoDB, are recommended to handle files,
documents, and other semi-structured data.
DATA IN THE CLOUD
DATA
39% of survey respondents typically store data in the cloud, compared
to 33% who store it on-premise and 23% who take a hybrid approach.
Of those using the cloud or hybrid solutions, Amazon was by far the
most popular vendor (70%) followed by Google Cloud (57%) and
Microsof Azure (39%).
IMPLICATIONS
The percentage of respondents using cloud solutions increased by
8% in 2018, while on-premise and hybrid storage decreased by 6%
and 4%, respectively. As cloud solutions become more and more
ubiquitous, the prospect of storing data on external hardware
becomes more appealing as a way to decrease costs. Another reason
for this increase may be in the abilities of some tools to process data,
such as AWS Kinesis.
RECOMMENDATIONS
Not every organization may need a way to store Big Data if they
do not yet have a strong use case for it. When a business strategy
around Big Data is created, however, a cloud storage solution
requires less up-front investment for smaller enterprises, though if
your business deals in sensitive information, a hybrid solution would
be a good compromise, or if you need insights as fast as possible you
may need to invest in an on-premise solution to access data quickly.
BLINDED WITH DATA SCIENCE
DATA
The biggest challenges in data science are working with unsanitized
or unclean data (64%), working within timelines (40%), and limited
training and talent (39%).
IMPLICATIONS
The return on investment of analytics projects is mostly focused on
the speed of decision-making (47%), the speed of data access (45%),
and data integrity (31%). These KPIs all contribute to the dificulty
of dealing with unclean data and project timelines. Insights are
demanded quickly, but data scientists need to take time to ensure
data quality is good so those insights are valuable.
RECOMMENDATIONS
Project timelines need to be able to accommodate the time it takes
to prepare data for analysis, which can range from omitting sensitive
information to deleting irrelevant values. One easy way to do this is to
sanitize user inputs, keeping users from adding too much nonsense to
your data. Our infographic on page 14 can give you a fun explanation
of why unclean and unsanitized data can be such a hassle and why it’s
important to gleaning insights.
Executive Summary
BY MATT WERNERPUBLICATIONS COORDINATOR, DZONE
THE DZONE GUIDE TO BIG DATA: STREAM PROCESSING, STATISTICS, AND SCALABILITY
THE DZONE GUIDE TO BIG DATA: STREAM PROCESSING, STATISTICS, AND SCALABILITY
DZ
ON
E.C
OM
/G
UI
DE
S
DZ
ON
E.C
OM
/G
UI
DE
S
PAGE 4 OF 35
DEMOGRAPHICS
540 sofware professionals completed DZone’s 2018 Big Data sur-
vey. Respondent demographics are as follows:
42% of respondents identify as developers or engineers, and 23%
identify as developer team leads.
54% of respondents have 10 years of experience or more; 28% have
15 years or more.
39% of respondents work at companies headquartered in Europe;
30% work in companies headquartered in North America.
23% of respondents work at organizations with more than 10,000
employees; 22% work at organizations between 500 and 10,000
employees.
77% develop web applications or services; 48% develop enterprise
business apps; and 21% develop native mobile applications.
78% work at companies using the Java ecosystem; 62% at compa-
nies that use JavaScript; and 37% at companies that use Python.
56% of respondents use Java as their primary language at work.
PYTHON
Python has been moving slowly towards the title of “most popular
language for data science” for years now, and the language to beat has
been R. R has been extraordinarily popular for data-heavy programming
for some time as an open source implementation of S, a language
specifically designed for statistical analysis. And while R still maintains
popularity (in the TIOBE index, it moved from the 16th place ranking
in January 2017 to the 8th place ranking in 2018), Python’s use in data
science and data mining projects has been steadily increasing. Last
year, respondents to DZone’s Big Data survey revealed that Python had
overcome R as the predominant language used for data science, though
its lead over R was statistically insignificant, and therefore didn’t
quite make it to “champion status.” This was mentioned in last year’s
research findings as being consistent with trends in other available
research on Python and R’s use in data science: R is still popular for
data/statistical analysis, but Python has been catching up.
This year, DZone’s Big Data survey showed a significant diference
between the use of R and Python for data science projects: R usage
decreased by 10%, from 60% to 50%, among survey respondents
in the last year, while Python increased 6%, from 64% to 70%.
This means 20% more respondents this year use Python for data
science than respondents who use R. While Python was not created
specifically for data analysis, its dynamic typing, easy-to-learn syntax,
and ever-increasing base of libraries has made it an ideal candidate
for developers to start delving into data science and analysis more
comfortably than they may have been able to in the past.
GRAPH 01. What languages/libraries/frameworks do you use
for data science and machine learning
GRAPH 02. What database management systems do you use
in production
Key Research Findings
BY G. RYAN SPAIN PRODUCTION COORDINATOR, DZONE。