BigData with Java interview Question

Saptarshi Chatterjee
11 min readSep 13, 2019

How you you design a scalable DataBase Architecture

  1. Microservice architecture: For example, in an e-commerce site, Checkout, Search, profile each of them is individual Spring boot Application with its own DataBase.When the app submits buy request, checkout app, checks with auth service that cookie is valid and provided userId owns the cookie, then updates it’s own DB.
  2. vertical Scaling: Use AWS Ops Automator, or manually just increase RDS instance capacity
  3. Horizontal Scaling: RDS Mysql can support up to 5 read replicas and provide transparent load-balancing.
  4. Sharding: Partitioning table based on keys. Add customer_id to all the tables. add an extra column shard key.

How Hbase stores data Internally.

Explain CAP theorem

CAP theorem state that a distributed database can’t get all these three notions at the same time:

Consistency — data is the same for every client

Availability — the system is up and does not return an error

Partition tolerance — The system will run even if there is a network failure

what are Shortcomings of HBase

Single point of failure: At the time when only one HMaster is used, there is a possibility of failure.

  • No transaction support: In HBase, there is no support for the transaction.
  • No handling of JOINS in database: Instead of the database itself, JOINs are handled in MapReduce layer.
  • Sorted Built-in authentication: There are no permissions or built-in authentication.
  • No support SQL structure: As there is no support for SQL structure, it cannot contain any query optimizer.

What is salting in HBase? How is it used to debug why the read is slow because data is not proper replicated.

HBase hot-spotting occurs because of the poorly designed row keys. Because of bad row key, HBase stores large amounts of data on a single node and entire traffic is redirected to this node when the client requests some data leaving another node idle.

Below are some of the techniques used to avoid hot-spotting:

The salting process is helpful when you have a small number of a fixed number of row keys that come up over and over again. For examples, let us consider you have below four-row key values:

machine0001machine0002machine0003machine0004

If you would like to write these across four different regions. You can use the four letters a, b, c, and d. The updated values would be:

a-machine0001b-machine0002c-machine0003d-machine0004

Hashing
The hashing mechanism is using hash functions to assign values instead of using random mechanisms.

You can use the one-way hash function that would allow row being stored is always be “salted” with the same prefix, that would spread the load across region servers.

Reversing the Key
A third common technique for preventing hot-spotting is to reverse a fixed-width or numeric row key so that the part that changes the most often is first.

How checkpointing works in Spark Streaming

How to declare a broadcast variable in Spark 2.0

If you want to broadcast class named Test it should implement java.io.Serializable and then you do:

import scala.reflect.ClassTag;

ClassTag<Test> classTagTest = scala.reflect.ClassTag$.MODULE$.apply(Test.class);
Broadcast<Test> broadcastTest = sc.sparkcontext().broadcast(new Test(), classTagTest);

How would you share a connection between driver and Executor in Kafka

Difference between Kafka and P2P Queue

How does Kafka’s notion of streams compare to a traditional enterprise messaging system?

Messaging traditionally has two models: queuing and publish-subscribe. In a queue, a pool of consumers may read from a server and each record goes to one of them; in publish-subscribe, the record is broadcast to all consumers. Each of these two models has a strength and a weakness. The strength of queuing is that it allows you to divide up the processing of data over multiple consumer instances, which lets you scale your processing. Unfortunately, queues aren’t multi-subscriber — once one process reads the data it’s gone. Publish-subscribe allows you broadcast data to multiple processes but has no way of scaling processing since every message goes to every subscriber.

The consumer group concept in Kafka generalizes these two concepts. As with a queue the consumer group allows you to divide up processing over a collection of processes (the members of the consumer group). As with publish-subscribe, Kafka allows you to broadcast messages to multiple consumer groups.

The advantage of Kafka’s model is that every topic has both these properties — it can scale processing and is also multi-subscriber — there is no need to choose one or the other.

Kafka has stronger ordering guarantees than a traditional messaging system, too.

A traditional queue retains records in-order on the server, and if multiple consumers consume from the queue then the server hands out records in the order they are stored. However, although the server hands out records in order, the records are delivered asynchronously to consumers, so they may arrive out of order on different consumers. This effectively means the ordering of the records is lost in the presence of parallel consumption. Messaging systems often work around this by having a notion of “exclusive consumer” that allows only one process to consume from a queue, but of course this means that there is no parallelism in processing.

Kafka does it better. By having a notion of parallelism — the partition — within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note however that there cannot be more consumer instances in a consumer group than partitions.

Kafka Guarantee

At a high-level Kafka gives the following guarantees:

  • Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.
  • A consumer instance sees records in the order they are stored in the log.
  • For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the log.

More details on these guarantees are given in the design section of the documentation.

What is the size of data and how long it took to process your data using Kafka? credit

Remember Google as of 2019 gets ~80000 queries /second. So a 3 node Kafka cluster is good enough for almost all payload.

Write performance (3 nodes @ 4 GB RAM, 1 CPU, 200 GB disk each, msg 512 Bytes):

What are some of the major problems you faced while developing your Spark Application?

Kafka Difference between Consumers and consumer groups.

If the number of Consumer Group nodes is more than the number of partitions, the excess nodes remain idle. If a consumer node takes multiple partitions or ends up taking multiple partitions on failover, those partitions will appear intermingled, if viewed as a single stream of messages. So a Consumer Group application could get row #100 from partition 3, then row #90 from partition 4, then back to partition 3 for row #101. Nothing in Kafka can guarantee order across partitions, as only messages within a partition are in order.

So it’s safe to have 1 partition per consumer group.

  1. What are the serialization methods used in Kafka (Avro)?
  2. Java Design Pattern -

Creational

A) Prototype: Clone an existing object rather than creating a new instance from scratch, something that may include costly operations. The existing object acts as a prototype and contains the state of the object.

abstract class Color implements Cloneable {
protected String colorName;
abstract void addColor();
public Color clone() {
Color clone = null;
try {
clone = (Color)super.clone();
} catch (CloneNotSupportedException e) { }
return clone;
}
}

class blueColor extends Color {
public blueColor() {
this.colorName = "blue";
}
@Override
void addColor() {
System.out.println("Blue color added");
}
}

B) Singleton …: The singleton pattern is a design pattern that restricts the instantiation of a class to one object. e.g. If you have a DB which supports only 1 connection

class Single {
protected static Single one;

private Single(){ }
public Single getSingle(){
if(one == null)
one = new Single();
return one;
}
}

Structural

A) Proxy B) Facade C) Adapter …

Behavioral

A) Iterator B) Chain of responsibility C) Observer …

11) Diamond Problem in java8 interface and how to solve it.

As java 8 supports default methods in the interface, It would have a diamond problem.

interface Poet {
default void write() {
System.out.println("Poet's default method");
}
}
interface Writer {
default void write() {
System.out.println("Writer's default method");
}
}

public class Multitalented implements Poet, Writer {
public static void main(String args[]) {
Multitalented john = new Multitalented();
john.write(); //which write method to call, from Poet or, from Writer
}
}
//Output:Compile Time Error : class Multitalented inherits unrelated defaults for write() from types Poet and Writer

Solution:

public class Multitalented implements Poet, Writer{
@Override
public void write(){
System.out.println("Writing stories now days");
}

public static void main(String args[]){
Multitalented john = new Multitalented();
john.write(); //This will call Multitalented#write() method
}
}
Output:Writing stories now days

12) What is a functional interface in Java8?

A functional interface is an interface that contains only one abstract method. Lambda expressions can be used to represent the instance of a functional interface. A functional interface can have any number of default methods. Runnable, ActionListener, Comparable are functional interfaces. Before Java 8, we had to create anonymous inner class objects or implement these interfaces.

@FunctionalInterface
interface Square {
int calculate(int x);
}

class Test
{
public static void main(String args[])
{
int a = 5;
// lambda expression to define the calculate method
Square s = (int x)->x*x;
// parameter passed and return type must be same as defined in the prototype
int ans = s.calculate(a);
System.out.println(ans);
}
}

Java8 Supplier and consumer Interfaces

Supplier<String> i = () -> "Car";
Consumer<String> c = x -> System.out.print(x.toLowerCase());
Consumer<String> d = x -> System.out.print(x.toUpperCase());
c.andThen(d).accept(i.get());
System.out.println();
//Outputs carCAR

Difference between Structured Stream and D-Stream in spark

D-Stream is based purely on Java/Python objects and functions, as opposed to the richer concept of structured tables in DataFrames and Datasets supported by Structured Stream. Second, the API is purely based on processing time, Structured Streaming has native support for event time data. Finally, DStreams can only operate in a micro-batch fashion, Structured Stream supports continuous processing.

Difference between SparkContext, JavaSparkContext, SQLContext, and SparkSession?

sparkContext is a Scala implementation entry point and JavaSparkContext is a java wrapper of sparkContext.

SQLContext is the entry point of SparkSQL which can be received from sparkContext.Prior to 2.x.x, RDD ,DataFrame and Data-set were three different data abstractions.Since Spark 2.x.x, All three data abstractions are unified and SparkSession are the unified entry point of Spark.

How to debug Kafka application . Credit

Kafkacat is a Generic command line non-JVM Apache Kafka producer and consumer, that is widely used for debugging.

kafkacat -L -b kafka // List all the topics available//transferring messages between Topic
kafkacat -C -b kafka -t awesome-topic -e | kafkacat -P -b kafka -t awesome-topic2
//Read specific meddase from the partition
kafkacat -C -b kafka -t superduper-topic -o -5 -e -p 5
/*
This command uses the -o flag which means “read from this offset” and when we feed it -5 it means “read 5 messages from the end” (sending 5 would start from offset 5). -e exits when the last message is read. -p 5 tells kafkacat to only read messages from partition 5.
*/

How to debug Spark Application. credit

We set up remote debugging in IntelliJ and debug our Spark application

export SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=4000
$ /opt/mapr/spark/spark-1.6.1/bin/spark-submit --class com.mapr.test.BasicSparkStringConsumer /mapr/myclust1/user/iandow/my-streaming-app-1.0-jar-with-dependencies.jar /user/iandow/mystream:mytopic

What is Reactive Spring Boot? (Server Side Events / Flux) . Credit

In the following code even though emitter is returned immediately, it will keep sending data forever until explicitly closed.

An SSE is a specification adopted by most browsers to allow streaming events unidirectionally at any time. The ‘events’ are just a stream of UTF-8 encoded text data that follow the format defined by the specification. While WebSockets offer full-duplex (bi-directional) communication between the server and the client, while SSE uses uni-directional communication.

Also, WebSockets isn’t an HTTP protocol and, opposite to SSE, it doesn’t offer error-handling standards.

@GetMapping("/stream-sse-mvc")
public SseEmitter streamSseMvc() {
SseEmitter emitter = new SseEmitter();
ExecutorService sseMvcExecutor = Executors.newSingleThreadExecutor();
sseMvcExecutor.execute(() -> {
try {
for (int i = 0; true; i++) {
SseEventBuilder event = SseEmitter.event()
.data("SSE MVC - " + LocalTime.now().toString())
.id(String.valueOf(i))
.name("sse event - mvc");
emitter.send(event);
Thread.sleep(1000);
}
} catch (Exception ex) {
emitter.completeWithError(ex);
}
});
return emitter;
}

How do you manage API contract in a microservice application

Using Pact(<artifactId>spring-cloud-contract-maven-plugin</artifactId>), we can define consumer expectations for a given provider (that can be an HTTP REST service) in the form of a contract (hence the name of the library).

request {
method PUT()
url '/fraudcheck'
body([
"client.id": $(regex('[0-9]{10}')),
loanAmount: 99999
])
headers {
contentType('application/json')
}
}
response {
status OK()
body([
fraudCheckStatus: "FRAUD",
"rejection.reason": "Amount too high"
])
headers {
contentType applicationJson()
}
}

Select 4th Highest Salary in the company

SELECT name, salary FROM Employee e1 WHERE  (SELECT COUNT(salary) FROM Employee e2 WHERE e2.salary > e1.salary) = 4-1SELECT TOP 1 name, salary FROM ( SELECT TOP 4 name, salary FROM Employee ORDER BY salary DESC ) AS temp ORDER BY salary

Java Memory model and garbage collection

Minor Gc - Happens on the young generation. Put from Eden to Survivor after compacting. Survivor1 and survivor2(Tenured) Copies between themselves and compact (copying garbage collector). Then when the young generation if full, it moves to the Old generation.

Major Gc - Happens in the Old generation. Concurrent Gc (Mark, remark, Sweep), Parallel Gc (Happens in separate Thread). G1 Gc as of java 8 , separates New and Old generation is a separate region and parallelly collects the region with most garbage.

Difference Between PUT and Patch

The main difference between the PUT and PATCH method is that the PUT method uses the request URI to supply a modified version of the requested resource which replaces the original version of the resource, whereas the PATCH method supplies a set of instructions to modify the resource.

What is a Resource in Rest

A resource in REST is a similar Object in Object-Oriented Programming or is like an Entity in a Database. Once a resource is identified then its representation is to be decided using a standard format so that the server can send the resource in the above-said format and the client can understand the same format.

--

--

Saptarshi Chatterjee

I work as a Bigdata Engineer in one of the most prestigious Investment Bank on Wall Street