Jack Norris View Bio
Wednesday, March 30 at 9:25am
Big data is not limited to reporting and analysis; increasingly the companies that are differentiating themselves are acting on data in real-time. But what does “real time” really mean? This talk will discuss the challenges of coordinating data flows, analysis, and integration at scale to truly impact business as it happens.
Wednesday, March 30 at 11:50am
Application developers and architects today are interested in making their applications as real-time as possible. To make an application respond to events as they happen, developers need a reliable way to move data as it is generated across different systems, one event at a time. In other words, these applications need messaging.
Messaging solutions have existed for a long time. However, when compared to legacy systems, newer solutions like Apache Kafka have higher performance, more scalability, and better integration with the Hadoop ecosystem. Kafka and similar systems are based on drastically different assumptions than legacy systems and have vastly different architectures. But do these benefits outweigh any tradeoffs in functionality? M. C. Srivas dives into the architectural details and tradeoffs of both legacy and new messaging solutions to find the ideal messaging system for Hadoop.
- Queues versus logs
- Security issues like authentication, authorization, and encryption
- Scalability and performance
- Handling applications that span multiple data centers
- Multitenancy considerations
- APIs and integration points, and more
Wednesday, March 30 at 4:20pm
SQL is normally a very static language that assumes a fixed and well-known schema for the data and flat data structures (with noncomplex field values). The received wisdom is that these assumptions about static data are required for performance. However, big data depends on being flexible and dynamic in order to manage nonlinear technical debts when scaling. This contradiction can make it difficult to process important data streams using SQL.
Apache Drill squares this circle by rethinking many of the assumptions that have been built into query systems over the last few decades. By moving much of the optimization and type specificity out of the query parsing and static optimization processes and into the execution process itself, the Drill query engine is able to very efficiently deal with data that has deeper structure and unknown schema. The optimization of the structure of the parallel computation can often be done without much detailed schema information, and detailed optimization with type and structure information can often be done very late in the execution process based on empirically observed schema information. This even allows alternative optimizations as changes in the data structure are observed across a large query.
The only strong assumption that Drill makes a priori is that the data being processed conforms to the JSON data model. There is not even a guarantee that any record has similar characteristics to any other record. Drill can still use such information if it is available early, or it can defer exploitation of such data until it is available. This requires wholesale restructuring of the query parsing, optimization, and execution process.
Ted Dunning walks attendees through Apache Drill, explaining potential use cases for the technology in Drill and why these extended capabilities matter to all big data practitioners.
- How Drill can process data as it exists in the wild without expensive ETL processes
- Why SQL has a strong future in the big data world, especially on NoSQL databases
- How Drill brings together the sophistication and familiarity of SQL with the flexibility of the Hadoop ecosystem.
Wednesday, March 30 at 5:10pm
Until recently, batch processing has been the standard model for big data. Largely, this is due to the very large influence of the original processing MapReduce implementation in Hadoop and the difficulty in replacing MapReduce in the original Hadoop framework.
Today, there has been a shift to streaming architectures using tools such as Apache Spark and Kafka. These architectures offer large benefits in terms of simplicity and robustness, but they are also surprisingly different from previous message-queuing designs. The changes in these new systems allow enormously higher scalability and make fault tolerance relatively simple to achieve while maintaining good latency.
Ted Dunning explores the key design techniques used in modern systems, including percolators, the big-data oscilloscope, replayable queues, state-point queuing, and universal micro-architectures.
Benefits of these techniques include:
- A decrease in total system complexity
- Flexible throughput/latency tradeoffs
- Fault tolerance without the difficulties of Lamdba Architecture
- Easy debuggability
Jim Scott View Bio
Thursday, March 31 at 4:20pm
Building scalable application platforms is not easy. Enabling servers to communicate with other servers is also not easy. While RPC is often used to manage communication between tiers of a scalable application, there is often some level of sharding of communication that occurs. While this can be effective, it brings with it a certain amount of management overhead. Every time a server is added or taken away some type of rebalancing must occur. Another option is to utilize a registry.
The intent of the Zeta Architecture is to support elastic expansion and contraction of services in different tiers of your stack to optimize resource utilization across the data center. Manual sharding of communications between applications is NOT an option. The best way to support communication between these dynamically scalable applications is to communicate via a messaging platform that can easily handle trillions of events per day. After all, if the messaging platform can’t handle the scale, then it will not suffice as a communication channel between applications.
We will walk through an example of data center monitoring to show how this works as well as covering the benefits of this model. Alternatives will also be discussed, like using a registry to track servers that are alive and taking requests, as well as any pros and cons that come along with these alternatives.
Steve Wooledge View Bio
Thursday, March 31 at 11:00am
Business doesn’t happen in neatly defined batches. Because of that, there’s substantial advantage to being able to make decisions at the speed required to respond to events in the moment. Companies want new applications that don’t simply report on business results but instead incorporate low-latency processing to have an impact on business as it happens. For example, in fraud detection the ability to analyze data and make a decision with very low latency may let you mobilize quickly enough to shut down fraudulent activities before large losses occur.
New technologies are changing what is possible with event stream processing. These new approaches not only enable you to deal with low latency decisions, they also give you surprising agility to respond to changing conditions. You will not only be able to tolerate change, you can embrace it as a competitive advantage. By using these emerging stream-based technologies, your organization can build software more quickly and more reliably, leaving you free to focus on business goals.
How is this possible? The key lies in being able to isolate services, and isolation of services, in turn, depends on a system that allows durability of event data at high speed, and at scale. Specific capabilities in the messaging layer of your system let you process streaming data immediately or when you are ready – the messages are there when you need them.
In this talk the audience will learn how these new technologies make the approach work and see how these practices can be applied in a variety of different settings, including retail, financial services and telecommunications.
Jack drives understanding and adoption of new applications enabled by data convergence. With over 20 years of enterprise software marketing experience, he has demonstrated success from defining new markets for small companies to increasing sales of new products for large public companies. Jack’s broad experience includes launching and establishing analytic, virtualization, and storage companies and leading marketing and business development for an early-stage cloud storage software provider.
Ted Dunning is Chief Application Architect at MapR Technologies and committer and PMC member of the Apache Mahout, Apache ZooKeeper, and Apache Drill projects. Ted has been very active in mentoring new Apache projects and is currently serving as vice president of incubation for the Apache Software Foundation. Ted was the chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems. He built fraud detection systems for ID Analytics (later purchased by LifeLock) and he has 24 patents issued to date and a dozen pending. Ted has a PhD in computing science from the University of Sheffield. When he’s not doing data science, he plays guitar and mandolin. He also bought the beer at the first Hadoop user group meeting..
Jim drives enterprise architecture and strategy at MapR. Jim Scott is the cofounder of the Chicago Hadoop Users Group. As cofounder, Jim helped build the Hadoop community in Chicago for the past four years. He has implemented Hadoop at three different companies, supporting a variety of enterprise use cases from managing Points of Interest for mapping applications, to Online Transactional Processing in advertising, as well as full data center monitoring and general data processing. Prior to MapR, Jim was SVP of Information Technology and Operations at SPINS, the leading provider of retail consumer insights, analytics reporting and consulting services for the Natural, Organic and Specialty Products industry. Additionally, he served as Lead Engineer/Architect for dotomi, one of the world’s largest and most diversified digital marketing companies. Prior to dotomi, Jim held several architect positions with companies such as aircell, NAVTEQ, Classified Ventures, Innobean, Imagitek, and Dow Chemical, where his work with high-throughput computing was a precursor to more standardized big data concepts like Hadoop.
Steve is Vice President, Product Marketing at MapR, and is responsible for identifying new market opportunities and increasing awareness for MapR technical innovations and solutions for Hadoop. Steve was previously Vice President of Marketing for Teradata Unified Data Architecture, where he drove Big Data strategy and market awareness across the product line, including Apache Hadoop. Steve has also held various roles in product and corporate marketing at Aster Data, Interwoven, and Business Objects, as well as sales and engineering roles at Business Objects and Dow Chemical. When not working, Steve enjoys juggling activities for his 5 kids and sneaking in some cycling or ski trips when he can.