Webinar Q & A Follow up: Predictive Analytics with Machine Learning and Hadoop

The recent Skytree and MapR webinar ”Predictive Analytics with Machine Learning and Hadoop” proved to be highly interactive and engaging.  As promised, Nitin and Jin have provided answers to questions that we were not able to get to during the webinar:

If you missed the webinar, you can watch the replay or view the slides

Q:  What benefits does Skytree offer the data scientist versus Spark MLlib?

A:   MLlib, as with Mahout and other open source ML libraries, are just that—general purpose libraries of methods. Skytree and other ML companies have more integrated, purpose-built ML systems for reliability and scalability.

Q:  Can Skytree server run in distributed mode under YARN?

A:  Yes. We announced our support for Yarn in June this year at the Hadoop Summit.

Q:  What analytic techniques does Skytree support? 

A:  Skytree Server is a general-purpose machine learning platform and supports a broad array of methods and algorithms. For specific details, visit the Skytree website.

Q:  Which Hadoop data formats does Skytree support?  (e.g. Parquet, HBase, compressed data, etc.)

A: Skytree Server is a general-purpose machine learning platform and supports a broad array of data formats. For specific details, visit their website.

Q:  How can Hadoop be used to implement/run proprietary ML techniques (i.e., those that don't necessarily fit into the standard ones that you've covered more thoroughly)?

A: The value of Hadoop for machine learning comes from parallelizing heavy computation by spreading data on many nodes. As long as you can easily parallelize your ML techniques, you should be able to easily run them on Hadoop. In cases where you cannot easily parallelize, you can run them on one node and leverage features such as NFS, which is available on MapR.

Q:  Is Hadoop the right tool to use even in applications where the data being analyzed in a ML framework might not really be "big"?

A: For situations where the accesses of the ML dataset are big, you can still leverage parallelization on Hadoop by distributing your data, albeit small, on many nodes, so you can benefit from smaller datasets finishing computation much faster than relatively larger datasets on one node. 

Q:  As a data scientist who has spent time developing algorithms using machine learning methods, what's a good way to get started on Hadoop or MapR? Is it possible to do it on a simple server?

A:  It is very easy to get started on Hadoop on a single server using the MapR Sandbox. You can download our Hadoop Sandbox here. By using NFS, you will be able to easily import data onto the Sandbox and run your code directly on Hadoop.

Q:  What Hadoop streaming formats does Skytree support?

A: Skytree supports its own proprietary stream-processing engine because enterprise quality streaming formats didn't exist until recently. In the very near future in-line with our philosophy to support open source data enablement methods, we will be migrating our streaming capabilities to Spark Streaming.

Q:  The presentation included a Skytree benchmark running on a single node.  Can Skytree run on many nodes?  Are any Skytree customers doing this today?

A:  Skytree supports and is optimized for distributed computing environments. We are fully integrated with Hadoop and have multiple installations globally running on multi-node clusters.

Q:  Would you feed data from your EDW into MapR or Skytree?

A: Skytree supports data ingestion from MapR FS, but also has the capability to interface directly with EDWs via its JDBC connectors.

Q:  Can you give a slightly more detailed example of an equipment failure case study?

A:  One example involves developing models to predict likely points of failure in energy distribution networks based on sensor data, environmental data, usage data, historical data, load distribution data, etc.

Q:  Are there any references that have info on how to integrate R with Hadoop?

A:  RHadoop from Revolution Analytics allows you to easily integrate R with Hadoop. RHadoop helps you parallelize R code to run in a cluster environment.  

Q:  Can sound be used?

A: Some data used in predictive analysis includes audio/sound. For example, customer calls are recorded and transcribed as part of churn analysis or next logical product analysis. Some insurance companies are experimenting with using cell phone audio (with permission from their customers) to evaluate whether their customer was a passenger or driver in an incident (by analyzing ambient sound) as part of faster claims processing. Many predictive maintenance applications also involve analyzing the sound signature of equipment as part of the totality of data sources.

Q:  Does Skytree plan to support Apache Spark?

A:  Absolutely. Skytree already supports and ships with Spark.

Q:  Any there any use cases in the media and entertainment industry?

A: The media and entertainment industry aggressively uses big data and machine learning techniques to better segment customers, identify next logical products/services, understand emerging trends via social media, perform sentiment analysis, and perform HR analytics to improve recruiting and talent retention, etc. These are just a small sample of use cases.

Q:  How can machine learning be applied to a staffing and resourcing company that has 10,000 consultants on billing and 1000+ recruiting and sales teams who interface with more than 10+ job boards to hunt for candidates? 

A: Machine learning methods can be used in many parts of HR analytics. For example, machine learning can be used to better understand the best source of leads, analyze a social network chain for outliers or indicators of good/bad candidates, analyze feedback on reviews to identify indicators of likely employee exits (specialized form of churn analysis), identify reactive sentiments to policy changes in order to evaluate impact on recruiting or retention, etc.

Q:  In what way does Skytree ensure that the user specifies the right model?

A: Skytree provides a variety of feedback (both visualization and data-centric) to enable users to evaluate the correct fit of models and parameter settings. Skytree also provides automatic model selection capability based on data characteristics. This capability is in an early access test phase, and is being used by several customers. The capability is planned to be generally available later this year.

Q:  How much does the Skytree license cost?

A:  Skytree licenses are charged on a per-node/per-core basis. Please contact sales@skytree.net for pricing details.

Q:  Can you give a slightly more detailed example of an equipment failure case study?

A: We are working on equipment failure, with predictive/prescriptive maintenance applications in various verticals such as oil & gas, utilities, transportation & logistics, and telecommunications. Typical examples include: predicting failure of parts in trucks to develop a more efficient maintenance schedule or backup route options, predicting failure in energy distribution networks to correct the system before actual failure can occur, and detecting communication network node failure to better manage network traffic.

Q:  Would you feed data from your EDW into MapR or Skytree?

A: Either approach can work. Skytree has customers that do either approach. It depends more on use cases, existing enterprise data platforms, performance requirements, types of data sources, etc.

Q:  How does Skytree help with logistics?  Does it supplement or replace the logistics optimization algorithms?

A:  Organizations using Skytree for logistics do not replace logistics optimization algorithms, but instead complement them. They are looking for specific point improvements through the use of advanced machine learning methods to fuse a larger variety of data, improve time to model development and insight for agility and greater accuracy, as well as other new use cases not covered by existing methods.


Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free